what next after pyparsing?

what next after pyparsing? - python

I have a huge grammar developed for pyparsing as part of a large, pure Python application.
I have reached the limit of performance tweaking and I'm at the point where the diminishing returns make me start to look elsewhere. Yes, I think I know most of the tips and tricks and I've profiled my grammar and my application to dust.
What next?
I hope to find a parser that gives me the same readability, usability (I'm using many advanced features of pyparsing such as parse-actions to start the post processing of the input which is being parsed) and python integration but at 10× the performance.
I love the fact the the grammar is pure Python.
All my basic blocks are regular expressions, so reusing them would be nice.
I know I can't have everything so I am willing to give up on some of the features I have today to get to the requested 10× performance.
Where do I go from here?

It looks like the pyparsing folks have anticipated your problem. From https://github.com/pyparsing/pyparsing/blob/master/docs/HowToUsePyparsing.rst :
Performance of pyparsing may be slow for complex grammars and/or large input strings. The psyco package can be used to improve the speed of the pyparsing module with no changes to grammar or program logic - observed improvments have been in the 20-50% range.
However, as Vangel noted in the comments below, psyco is an obsolete project as of March 2012. Its successor is the PyPy project, which starts from the same basic approach to performance: use a JIT native-code compiler instead of a bytecode interpreter. You should be able to achieve similar or greater gains with PyPy if switching Python implementations will work for you.
If you're really a speed demon, but want to keep some of the legibility and declarative syntax, I'd suggest having a look at ANTLR. Probably not the Python-generating backend; I'm skeptical whether that's mature or high-performance enough for your needs. I'm talking about the goods: the C backend that started it all.
Wrap a Python C extension module around the entry point to the parser, and turn it loose.
Having said that, you'll be giving up a lot in this transition: basically any Python you want to do in your parser will have to be done through the C API (not altogether pretty). Also, you'll have to get used to very different ways of doing things. ANTLR has its charms, but it's not based on combinators, so there's not the easy and fluid relationship between your grammar and your language that there is in pyparsing. Plus, it's its own DSL, much like lex/yacc, which can present a learning curve – but, because it's LL based, you'll probably find it easier to adapt to your needs.

Switch to a generated C/C++ parser (using ANTLR, flex/bison, etc.). If you can delay all the action rules until after you are done parsing, you might be able to build an AST with trivial code and then pass that back to your python code via something like SWIG and process on it with your current actions rules. OTOH, for that to give you a speed boost, the parsing has to be the heavy lifting. If your action rules are the big cost, then this will buy you nothing unless you write your action rules in C as well (but you might have to do it anyway to avoid paying for whatever impedance mismatch you get between the python and C code).

If you really want performance for large grammars, look no farther than SimpleParse (which itself relies on mxTextTools, a C extension). However, know now that it comes at the cost of being more cryptic and requiring that you be well-versed in EBNF.
It's definitely not the more Pythonic route, and you're going to have to start all over with an EBNF grammar to use SimpleParse.

A bit late to the party, but PLY (Python Lex-Yacc), has served me very well. PLY gives you a pure Python framework for constructing lex-based tokenizers, and yacc-based LR parsers.
I went this route when I hit performance issues with pyparsing.
Here is a somewhat old but still interesting article on Python parsing which includes benchmarks for ANTLR, PLY and pyparsing. PLY is roughly 4 times faster than pyparsing in this test.

There's no way to know what kind of benefit you'll get without just testing it, but it's within the range of possibility that you could get 10x benefit just from using Unladen Swallow if your process is long-running and repetitive. (Also, if you have many things to parse and you typically start a new interpreter for each one, Unladen Swallow gets faster - to a point - the longer you run your process, so while parsing one input might not show much gain, you might get significant gains on the 2nd and 3rd inputs in the same process).
(Note: pull the latest out of SVN - you'll get far better performance than the latest tarball)

Related

Reverse Engineer Auto-Generated C?

How easy is it to reverse engineer an auto-generated C code? I am working on a Python project and as part of my work, am using Cython to compile the code for speedup purposes.
This does help in terms of speed, yet, I am concerned that where I work, some people would try to "peek" into the code and figure out what it does.
Cython code is basically an auto-generated C. Is it very hard to reverse engineer it?
Are there any recommendations that would make the code safer and reverse-engineering harder to do? (I assume that with enough effort, everything can be reversed engineered).

Okay -- to attempt to answer your question more directly: most auto-generated C code is fairly ugly, so somebody would need to be fairly motivated to reverse engineer it. At the same time, I don't believe I've never looked at what Cython generates, so I'm not sure how it looks.
In addition, a lot of auto-generated code is done in the form of things like state machine tables, that most programmers find fairly difficult to follow even at best. The tendency (in many cases) is to have a generic framework, with tables of data that the framework more or less "interprets" at run-time. This isn't necessarily impossible to follow, but it's enough different from most typical code that most people will give up on it fairly quickly (and if they do much, they'll typically waste a lot of time looking at the framework instead of the data, which is what really matters in cases like this).
I'll repeat, however, that I'm pretty sure I haven't looked at what Cython produces, so I can't say much about it with any real certainty.
There are (or at least used to be) commercial obfuscators intended to make C source code difficult to understand. I suspect the availability of Perl has taken a lot of the market share from them, but if you look you may still be able to find one and use it.
Absent that, it's not terribly difficult to write an obfuscator of your own, but the degree of effectiveness will probably vary with the amount of effort you're willing to put into it. Just systematically renaming any meaningful variable names into things like _ and __ can do quite a bit (e.g., profit = sales - costs; is a lot more meaningful than _ = _I_ - _i_;). Depending on the machine generated code in question, however, this may not really accomplish much -- obfuscating a generic framework may not make much difference in understanding what your code does -- and if they figure out the procedure you're following, they may be able to simply replicate the correct framework code and transplant the pieces specific to your program into the un-obfuscated framework.

You should really take a look at the code that Cython produces. To help with debugging, for example, it copies the complete Python source code into the generated file, marking each source line before generating C code for it. That makes it very easy to find the code section that you are interested in.
A very nice feature is that you can compile your code with the "-a" (annotate) option, and it will spit out an HTML file next to the C file that contains the annotated Python code. When you click on a line, you'll see the C code for that line. As a bonus, it marks lines that do a lot of Python processing in dark yellow, so that you get a simple indicator where to look for potential optimisations.
There's also special gdb support in Cython now, so you can do Cython source level debugging etc.

Ah, I guess I missed the bit that you were talking about the compiled module, whereas I was only referring to the source code that Cython generates. I agree with Jerry that it will be fairly tricky to extract something useful from the compiled module, as long as you keep the gdb support disabled (the default) and strip the debugging symbols. That is because the C compiler will do lots of inlining of helper functions all over the place and apply various low-level code optimisations, thus making it harder to extract the original macro level code patterns. However, you will see named C-API calls to CPython, and you will also see function names from your own code. Cython isn't specifically designed for code obfuscation, quite the opposite. But readable assembly has certainly never been a design goal.

Python vs C : Line of Code Comparison vs Dev Time

Hi I'm currently learning Python since the syntax feels so succinct and the idioms match well with my mental model.
However I'm also interested in learning about OS internals and reverse engineering software, which ultimately means knowing C in a rather thorough capacity.
When originally picking a language I did lots of reading and comparisons, and it seems that a number thrown out a lot is that to write short idiomatic statements in Python would require the equivalent of a few hundred lines of C (I'd guess code for memory management, writing the code for dictionaries,lists etc) that we take for granted as built into the Python language.
1) With an average C programmer, is that 100-200 lines of code per Python idiom anywhere near accurate?
Because C doesn't come built-in with Python-like constructs such as dictionaries/lists(with all their nice methods etc):
2) Do C programmers tend to build these constructs from scratch and then re-use them between projects to greatly reduce the actual amount of hand coding for their projects?
I assume re-using libraries like boost:: stuff also again, reduces some of the boilerplate hand coding also...
3) But does using popular libraries and re-using common code one has written before in C for basic constructs/etc, how much does that revise the lines of code written in C compared to the code in Python of a enthusiast sized code base?
I know specific numbers aren't possible, but is it possible with libraries, code reuse etc, to have a development time in C close to that of Python without being a Linus Torvalds style coding machine?
Thanks!

but is it possible with libraries, code reuse etc, to have a development time in C close to that of Python
No.
You've missed the most important point.
Python's interactive. It's not edit-compile-link-execute-break-debug. It's edit-debug.

Boost is C++, not C (emphatically not C -- virtually all of it makes heavy use of templates and such that aren't part of C).
Yes, C programmers tend to build up personal libraries of code for all sorts of "stuff" -- data structures, algorithms, user interfaces, and so on. There are also a fair number of other libraries for everything from basic string manipulation to database connectivity, user interfaces, basic algorithms and data structures, etc.
Comparing productivity between the two can be difficult though -- even if something can be done in one line of code either way, there's a greater chance that the C programmer will end up doing extra work to find and learn to use that particular library. OTOH, if he has used it before, the two might be directly competitive of (in a few cases) C might be more productive.
I'd guess Python ends up more productive more often, but trying to guess how much so is difficult (and lines of code usually won't be a good indication either).

As I did serious c programming I read a book that claimed libraries are worth to write. (Especially in C which considered a low level language)
Libraries are build for reuse.
If you use libraries you write one line like detectFace( faceDesriptor ) or renderPDF( document) is doesn't matter whether an idiom in another language is more concise or not.
Lines of code isn't a proper metric if it is about what would more efficient.

It depends.
Try to write an interrupt handler in python. Someone could probably make it work but it's going to be a dancing bear, the dancing is not good but it is surprising that a bear can do it. Want to write an OS or do some embedded programming you're not going to be able to use python. It's telling that the main python implementation is written in C.
That being said I'm amazed at some of the low-level stuff that you can do with python. The high-level stuff is almost a given if you're measuring lines of code. Python is just a higher-level language.
They are both very useful tools, just for different types of projects. Knowing both would be very useful, particularly when you need to interface to some new functionality in python that doesn't yet have a python binding.
For the types of projects most developers work on python is going to be more consice and quicker to write and debug. You may be able to make a library of reusable C code, but a good python programmer will be doing the same thing with their python code, at a higher level.

I think Python is more productive for small projects (up to a few thousand lines of code).
On the other hand, C is better suited for large projects (even though IMHO there are better languages for that, such as Ada): static type-checking allows to find many errors at compile time that are much more difficult to detect at run-time, especially in a large program.
In a larger C project, the lack of lists and other powerful data structures that are found in Python can be compensated by implementing or using custom libraries. I agree with user stacker that by using well-designed libraries your C code can be pretty concise.

Depends greatly on the task and the size of the project. For many small interesting tasks, I would not be surprised by 100:1 smaller Python code simply because the standard libraries are extremely good. If you find, buy, or build C/C++ libraries that do what you want, I imagine the ratio would be much more like 3:1 on big projects.
However, finding, buying, and building C/C++ libraries does take time and effort, so I believe in the vast majority of cases, Python is going to be much faster to develop in.

Performance of Python worth the cost?

I'm looking at implementing a fuzzy logic controller based on either PyFuzzy (Python) or FFLL (C++) libraries.
I'd prefer to work with python but am unsure if the performance will be acceptable in the embedded environment it will work in (either ARM or embedded x86 proc both ~64Mbs of RAM).
The main concern is that response times are as fast as possible (an update rate of 5hz+ would be ideal >2Hz is required). The system would be reading from multiple (probably 5) sensors from an RS232 port and provide 2/3 outputs based on the results of the fuzzy evaluation.
Should I be concerned that Python will be too slow for this task?

In general, you shouldn't obsess over performance until you've actually seen it become a problem. Since we don't know the details of your app, we can't say how it'd perform if implemented in Python. And since you haven't implemented it yet, neither can you.
Implement the version you're most comfortable with, and can implement fastest, first. Then benchmark it. And if it is too slow, you have three options which should be done in order:
First, optimize your Python code
If that's not enough, write the most performance-critical functions in C/C++, and call that from your Python code
And finally, if you really need top performance, you might have to rewrite the whole thing in C++. But then at least you'll have a working prototype in Python, and you'll have a much clearer idea of how it should be implemented. You'll know what pitfalls to avoid, and you'll have an already correct implementation to test against and compare results to.

Python is very slow at handling large amounts of non-string data. For some operations, you may see that it is 1000 times slower than C/C++, so yes, you should investigate into this and do necessary benchmarks before you make time-critical algorithms in Python.
However, you can extend python with modules in C/C++ code, so that time-critical things are fast, while still being able to use python for the main code.

Make it work, then make it work fast.

If most of your runtime is spent in C libraries, the language you use to call these libraries isn't important. What language are your time-eating libraries written in ?

From your description, speed should not be much of a concern (and you can use C, cython, whatever you want to make it faster), but memory would be. For environments with 64 Mb max (where the OS and all should fit as well, right ?), I think there is a good chance that python may not be the right tool for target deployment.
If you have non trivial logic to handle, I would still prototype in python, though.

I never really measured the performance of pyfuzzy's examples, but as the new version 0.1.0 can read FCL files as FFLL does. Just describe your fuzzy system in this format, write some wrappers, and check the performance of both variants.
For reading FCL with pyfuzzy you need the antlr python runtime, but after reading you should be able to pickle the read object, so you don't need the antlr overhead on the target.

I'm a .NET Programmer. What are specific uses of Python and/or Ruby for that will make me more productive?

I recall when I first read Pragmatic Programmer that they suggested using scripting languages to make you a more productive programmer.
I am in a quandary putting this into practice.
I want to know specific ways that using Python or Ruby can make me a more productive .NET developer.
One specific way per answer, and even better if you can say whether I could use Python or Ruby or Both for it.
See standard format below.

IronPython / IronRuby
IronPython in Action will do a better job explaining this (and exactly how best to use IronPython) that can possibly be accommodated in a SO answer. I'm biased -- I was a tech reviewer and am a friend of one of the authors -- but objectively think it's a great book. (No idea if IronRuby is blessed with a similarly wonderful book, yet).
As you want "one specific way per answer" (incompatible with SO, which STRONGLY discourages a poster posting 25 different answers if they have 25 "specific ways" to indicate...!-): prototyping in order to explore some specific assembly or collection thereof that you're unfamiliar with (to check if you've understood their docs right and how to perform certain tasks) is an order of magnitude more productive in IronPython than in C#, as you can explore interactively and compilation is instantaneous and as-needed. (Have not tried IronRuby but I'll assume it can work in a roughly equivalent way and speed).

Less Code
I think productivity is direct result on how proficient you are in a specific language. That said the terseness of a language like Python might save some time on getting certain things done.
If I compare how much less code I have to write for simple administration scripts (e.g. clean-up of old files) compared to .NET code there is certain amount of productivity gain. (Plus it is more fun which also helps getting the job done)

Advanced Text Processing
Traditional strengths of awk and perl. You can just glue together a bunch of regular expressions to create a simple data-mining system on the go.

Learning a new language gives you knowledge that you can bring back to any programming language. Here are some things you'd learn.
Add functionality to your objects on the fly.
Mix in modules.
Pass a chunk of code around.
Figure out how to do more with less code: ruby -e "puts 'hello world'"
C# can do some of these things, but a fresh perspective might bring you one step closer to automating your breakfast.

Embedding a script engine
Use of IronPython for a scripting engine inside your .NET application. For example enabling end-users of your application to change customizable parts with a full fledge language such as Python.
A possible example might be to expose custom logic to end-users for a work flow engine.

Quick Prototyping - Both
In the simplest cases when firing a python interpreter and writing a line or two is way faster than creating a new project in visual studio.
And you can use ruby to. Or lua, or evel perl, whatever. The point is implicit typing and light-weight feel.

Cross platform
Compared to .NET a simple script Python is more easily ported to other platforms such as Linux. Although possible to achieve the same with the likes of Mono it simpler to run a Python script file on different platforms.

Processing received Email
Python has built-in support for POP3 and IMAP where the standard .NET framework doesn't. Useful for automating email triggered tasks.

Why not always use psyco for Python code?

psyco seems to be quite helpful in optimizing Python code, and it does it in a very non-intrusive way.
Therefore, one has to wonder. Assuming you're always on a x86 architecture (which is where most apps run these days), why not just always use psyco for all Python code? Does it make mistakes sometimes and ruins the correctness of the program? Increases the runtime for some weird cases?
Have you had any negative experiences with it? My most negative experience so far was that it made my code faster by only 15%. Usually it's better.
Naturally, using psyco is not a replacement for efficient algorithms and coding. But if you can improve the performance of your code for the cost of two lines (importing and calling psyco), I see no good reason not to.

1) The memory overhead is the main one, as described in other answers. You also pay the compilation cost, which can be prohibitive if you aren't selective. From the user reference:
Compiling everything is often overkill for medium- or large-sized applications. The drawbacks of compiling too much are in the time spent compiling, plus the amount of memory that this process consumes. It is a subtle balance to keep.
2) Performance can actually be harmed by Psyco compilation. Again from the user guide ("known bugs" section):
There are also performance bugs: situations in which Psyco slows down the code instead of accelerating it. It is difficult to make a complete list of the possible reasons, but here are a few common ones:
The built-in map and filter functions must be avoided and replaced by list comprehension. For example, map(lambda x: x*x, lst) should be replaced by the more readable but more recent syntax [x*x for x in lst].
The compilation of regular expressions doesn't seem to benefit from Psyco. (The execution of regular expressions is unaffected, since it is C code.) Don't enable Psyco on this module; if necessary, disable it explicitely, e.g. by calling psyco.cannotcompile(re.compile).
3) Finally, there are some relatively obscure situations where using Psyco will actually introduce bugs. Some of them are listed here.

Psyco currently uses a lot of memory.
It only runs on Intel 386-compatible
processors (under any OS) right now.
There are some subtle semantic
differences (i.e. bugs) with the way
Python works; they should not be
apparent in most programs.
See also the caveats section. For a hard example, I noticed that my web app with Cheetah-generated templates and DB I/O gained no appreciable speedup.

When using pyglet I found that I couldn't use psyco on the entire app without making my app non-functional. I could use it in small sections of math-heavy code, of course, but it wasn't necessary, so I didn't bother.
Also, psyco has done strange things with my profiling results (such as, well, not alter them at all from the non-psyco version). I suspect it doesn't play nice with the profiling code.
I just don't really use it unless I really want the speed, which is not all that often. My priority is algorithm optimization, which generally results in nicer speedups.

It also depends where your bottleneck is. I am mostly doing web apps and there the bottlenecks are probably more IO and database. So you should know where to optimize.
Also beware that maybe you first should think about your code instead of directly throwing psyco at it. So I agree with Devin, that algorithm optimizations should come first and they might have a smaller chance of unwanted sideeffects.

psyco is dead and not longer maintained. It is time to find another

One should never rely on some magic bullet to fix your problems. Using psyco to make a slow program faster is usually not necessary. Bad algorithms can be rewritten, and parts that require speed could be written in another language. Of course, your question asks why we don't use it for the speed boost anyways, and there's a bit of overhead when you use psyco. Psyco uses memory, and those two lines just sorta feel like overhead when you look at them. As for my personal reason on why I don't use psyco, it's because it doesn't support x86_64, which I see as the new up and coming architecture (especially with 2038 approaching sooner or later). My alternative is pypy, but I'm not entirely fond of that either.

A couple of other things:
It doesn't seem to be very actively maintained.
It can be a memory hog.

Quite simply: "Because the code already runs fast enough".

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.