Reverse Engineer Auto-Generated C? - python

How easy is it to reverse engineer an auto-generated C code? I am working on a Python project and as part of my work, am using Cython to compile the code for speedup purposes.
This does help in terms of speed, yet, I am concerned that where I work, some people would try to "peek" into the code and figure out what it does.
Cython code is basically an auto-generated C. Is it very hard to reverse engineer it?
Are there any recommendations that would make the code safer and reverse-engineering harder to do? (I assume that with enough effort, everything can be reversed engineered).

Okay -- to attempt to answer your question more directly: most auto-generated C code is fairly ugly, so somebody would need to be fairly motivated to reverse engineer it. At the same time, I don't believe I've never looked at what Cython generates, so I'm not sure how it looks.
In addition, a lot of auto-generated code is done in the form of things like state machine tables, that most programmers find fairly difficult to follow even at best. The tendency (in many cases) is to have a generic framework, with tables of data that the framework more or less "interprets" at run-time. This isn't necessarily impossible to follow, but it's enough different from most typical code that most people will give up on it fairly quickly (and if they do much, they'll typically waste a lot of time looking at the framework instead of the data, which is what really matters in cases like this).
I'll repeat, however, that I'm pretty sure I haven't looked at what Cython produces, so I can't say much about it with any real certainty.
There are (or at least used to be) commercial obfuscators intended to make C source code difficult to understand. I suspect the availability of Perl has taken a lot of the market share from them, but if you look you may still be able to find one and use it.
Absent that, it's not terribly difficult to write an obfuscator of your own, but the degree of effectiveness will probably vary with the amount of effort you're willing to put into it. Just systematically renaming any meaningful variable names into things like _ and __ can do quite a bit (e.g., profit = sales - costs; is a lot more meaningful than _ = _I_ - _i_;). Depending on the machine generated code in question, however, this may not really accomplish much -- obfuscating a generic framework may not make much difference in understanding what your code does -- and if they figure out the procedure you're following, they may be able to simply replicate the correct framework code and transplant the pieces specific to your program into the un-obfuscated framework.

You should really take a look at the code that Cython produces. To help with debugging, for example, it copies the complete Python source code into the generated file, marking each source line before generating C code for it. That makes it very easy to find the code section that you are interested in.
A very nice feature is that you can compile your code with the "-a" (annotate) option, and it will spit out an HTML file next to the C file that contains the annotated Python code. When you click on a line, you'll see the C code for that line. As a bonus, it marks lines that do a lot of Python processing in dark yellow, so that you get a simple indicator where to look for potential optimisations.
There's also special gdb support in Cython now, so you can do Cython source level debugging etc.

Ah, I guess I missed the bit that you were talking about the compiled module, whereas I was only referring to the source code that Cython generates. I agree with Jerry that it will be fairly tricky to extract something useful from the compiled module, as long as you keep the gdb support disabled (the default) and strip the debugging symbols. That is because the C compiler will do lots of inlining of helper functions all over the place and apply various low-level code optimisations, thus making it harder to extract the original macro level code patterns. However, you will see named C-API calls to CPython, and you will also see function names from your own code. Cython isn't specifically designed for code obfuscation, quite the opposite. But readable assembly has certainly never been a design goal.

Related

Are there advantages to use the Python/C interface instead of Cython?

I want to extend python and numpy by writing some modules in C or C++, using BLAS and LAPACK. I also want to be able to distribute the code as standalone C/C++ libraries. I would like this libraries to use both single and double precision float. Some examples of functions I will write are conjugate gradient for solving linear systems or accelerated first order methods. Some functions will need to call a Python function from the C/C++ code.
After playing a little with the Python/C API and the Numpy/C API, I discovered that many people advocate the use of Cython instead (see for example this question or this one). I am not an expert about Cython, but it seems that for some cases, you still need to use the Numpy/C API and know how it works. Given the fact that I already have (some little) knowledge about the Python/C API and none about Cython, I was wondering if it makes sense to keep on using the Python/C API, and if using this API has some advantages over Cython. In the future, I will certainly develop some stuff not involving numerical computing, so this question is not only about numpy. One of the thing I like about the Python/C API is the fact that I learn some stuff about how the Python interpreter is working.
Thanks.
The current "top answer" sounds a bit too much like FUD in my ears. For one, it is not immediately obvious that the Average Developer would write faster code in C than what NumPy+Cython gives you anyway. Quite the contrary, the time it takes to even get the necessary C code to work correctly in a Python environment is usually much better invested in writing a quick prototype in Cython, benchmarking it, optimising it, rewriting it in a faster way, benchmarking it again, and then deciding if there is anything in it that truly requires the 5-10% more performance that you may or may not get from rewriting 2% of the code in hand-tuned C and calling it from your Cython code.
I'm writing a library in Cython that currently has about 18K lines of Cython code, which translate to almost 200K lines of C code. I once managed to get a speed-up of almost 25% for a couple of very important internal base level functions, by injecting some 20 lines of hand-tuned C code in the right places. It took me a couple of hours to rewrite and optimise this tiny part. That's truly nothing compared to the huge amount of time I saved by not writing (and having to maintain) the library in plain C in the first place.
Even if you know C a lot better than Cython, if you know Python and C, you will learn Cython so quickly that it's worth the investment in any case, especially when you are into numerics. 80-95% of the code you write will benefit so much from being written in a high-level language, that you can safely lay back and invest half of the time you saved into making your code just as fast as if you had written it in a low-level language right away.
That being said, your comment that you want "to be able to distribute the code as standalone C/C++ libraries" is a valid reason to stick to plain C/C++. Cython always depends on CPython, which is quite a dependency. However, using plain C/C++ (except for the Python interface) will not allow you to take advantage of NumPy either, as that also depends on CPython. So, as usual when writing something in C, you will have to do a lot of ground work before you get to the actual functionality. You should seriously think about this twice before you start this work.
First, there is one point in your question I don't get:
[...] also want to be able to distribute the code as standalone C/C++ libraries. [...] Some functions will need to call a Python function from the C/C++ code.
How is this supposed to work?
Next, as to your actual question, there are certainly advantages of using the Python/C API directly:
Most likely, you are more familar with writing C code than writing Cython code.
Writing your code in C gives you maximum control. To get the same performance from Cython code as from equivalent C code, you'll have to be very careful. You'll not only need to make sure to declare the types of all variables, you'll also have to set some flags adequately -- just one example is bounds checking. You will need intimate knowledge how Cython is working to get the best performance.
Cython code depends on Python. It does not seem to be a good idea to write code that should also be distributed as standalone C library in Cython
The main disadvantage of the Python/C API is that it can be very slow if it's used in an inner loop. I'm seeing that calling a Python function takes a 80-160x hit over calling an equivalent C++ function.
If that doesn't bother your code then you benefit from being able to write some chunks of code in Python, have access to Python libraries, support callbacks written directly in Python. That also means that you can make some changes without recompiling, making prototyping easier.

Python vs C : Line of Code Comparison vs Dev Time

Hi I'm currently learning Python since the syntax feels so succinct and the idioms match well with my mental model.
However I'm also interested in learning about OS internals and reverse engineering software, which ultimately means knowing C in a rather thorough capacity.
When originally picking a language I did lots of reading and comparisons, and it seems that a number thrown out a lot is that to write short idiomatic statements in Python would require the equivalent of a few hundred lines of C (I'd guess code for memory management, writing the code for dictionaries,lists etc) that we take for granted as built into the Python language.
1) With an average C programmer, is that 100-200 lines of code per Python idiom anywhere near accurate?
Because C doesn't come built-in with Python-like constructs such as dictionaries/lists(with all their nice methods etc):
2) Do C programmers tend to build these constructs from scratch and then re-use them between projects to greatly reduce the actual amount of hand coding for their projects?
I assume re-using libraries like boost:: stuff also again, reduces some of the boilerplate hand coding also...
3) But does using popular libraries and re-using common code one has written before in C for basic constructs/etc, how much does that revise the lines of code written in C compared to the code in Python of a enthusiast sized code base?
I know specific numbers aren't possible, but is it possible with libraries, code reuse etc, to have a development time in C close to that of Python without being a Linus Torvalds style coding machine?
Thanks!
but is it possible with libraries, code reuse etc, to have a development time in C close to that of Python
No.
You've missed the most important point.
Python's interactive. It's not edit-compile-link-execute-break-debug. It's edit-debug.
Boost is C++, not C (emphatically not C -- virtually all of it makes heavy use of templates and such that aren't part of C).
Yes, C programmers tend to build up personal libraries of code for all sorts of "stuff" -- data structures, algorithms, user interfaces, and so on. There are also a fair number of other libraries for everything from basic string manipulation to database connectivity, user interfaces, basic algorithms and data structures, etc.
Comparing productivity between the two can be difficult though -- even if something can be done in one line of code either way, there's a greater chance that the C programmer will end up doing extra work to find and learn to use that particular library. OTOH, if he has used it before, the two might be directly competitive of (in a few cases) C might be more productive.
I'd guess Python ends up more productive more often, but trying to guess how much so is difficult (and lines of code usually won't be a good indication either).
As I did serious c programming I read a book that claimed libraries are worth to write. (Especially in C which considered a low level language)
Libraries are build for reuse.
If you use libraries you write one line like detectFace( faceDesriptor ) or renderPDF( document) is doesn't matter whether an idiom in another language is more concise or not.
Lines of code isn't a proper metric if it is about what would more efficient.
It depends.
Try to write an interrupt handler in python. Someone could probably make it work but it's going to be a dancing bear, the dancing is not good but it is surprising that a bear can do it. Want to write an OS or do some embedded programming you're not going to be able to use python. It's telling that the main python implementation is written in C.
That being said I'm amazed at some of the low-level stuff that you can do with python. The high-level stuff is almost a given if you're measuring lines of code. Python is just a higher-level language.
They are both very useful tools, just for different types of projects. Knowing both would be very useful, particularly when you need to interface to some new functionality in python that doesn't yet have a python binding.
For the types of projects most developers work on python is going to be more consice and quicker to write and debug. You may be able to make a library of reusable C code, but a good python programmer will be doing the same thing with their python code, at a higher level.
I think Python is more productive for small projects (up to a few thousand lines of code).
On the other hand, C is better suited for large projects (even though IMHO there are better languages for that, such as Ada): static type-checking allows to find many errors at compile time that are much more difficult to detect at run-time, especially in a large program.
In a larger C project, the lack of lists and other powerful data structures that are found in Python can be compensated by implementing or using custom libraries. I agree with user stacker that by using well-designed libraries your C code can be pretty concise.
Depends greatly on the task and the size of the project. For many small interesting tasks, I would not be surprised by 100:1 smaller Python code simply because the standard libraries are extremely good. If you find, buy, or build C/C++ libraries that do what you want, I imagine the ratio would be much more like 3:1 on big projects.
However, finding, buying, and building C/C++ libraries does take time and effort, so I believe in the vast majority of cases, Python is going to be much faster to develop in.

Good practices in writing MATLAB code? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I would like to know the basic principles and etiquette of writing a well structured code.
Read Code Complete, it will do wonders for everything. It'll show you where, how, and when things matter. It's pretty much the Bible of software development (IMHO.)
These are the most important two things to keep in mind when you are writing code:
Don't write code that you've already written.
Don't write code that you don't need to write.
MATLAB Programming Style Guidelines by Richard Johnson is a good resource.
Well, if you want it in layman's terms:
I reccomend people to write the shortest readable program that works.
There are a lot more rules about how to format code, name variables, design classes, separate responsibilities. But you should not forget that all of those rules are only there to make sure that your code is easy to check for errors, and to ensure it is maintainable by someone else than the original author. If keep the above reccomendation in mind, your progam will be just that.
This list could go on for a long time but some major things are:
Indent.
Descriptive variable names.
Descriptive class / function names.
Don't duplicate code. If it needs duplication put in a class / function.
Use gettors / settors.
Only expose what's necessary in your objects.
Single dependency principle.
Learn how to write good comments, not lots of comments.
Take pride in your code!
Two good places to start:
Clean-Code Handbook
Code-Complete
If you want something to use as a reference or etiquette, I often follow the official Google style conventions for whatever language I'm working in, such as for C++ or for Python.
The Practice of Programming by Rob Pike and Brian W. Kernighan also has a section on style that I found helpful.
First of all, "codes" is not the right word to use. A code is a representation of another thing, usually numeric. The correct words are "source code", and the plural of source code is source code.
--
Writing good source code:
Comment your code.
Use variable names longer than several letters. Between 5 and 20 is a good rule of thumb.
Shorter lines of code is not better - use whitespace.
Being "clever" with your code is a good way to confuse yourself or another person later on.
Decompose the problem into its components and use hierarchical design to assemble the solution.
Remember that you will need to change your program later on.
Comment your code.
There are many fads in computer programming. Their proponents consider those who are not following the fad unenlightened and not very with-it. The current major fads seem to be "Test Driven Development" and "Agile". The fad in the 1990s was 'Object Oriented Programming'. Learn the useful core parts of the ideas that come around, but don't be dogmatic and remember that the best program is one that is getting the job done that it needs to do.
very trivial example of over-condensed code off the top of my head
for(int i=0,j=i; i<10 && j!=100;i++){
if i==j return i*j;
else j*=2;
}}
while this is more readable:
int j = 0;
for(int i = 0; i < 10; i++)
{
if i == j
{
return i * j;
}
else
{
j *= 2;
if(j == 100)
{
break;
}
}
}
The second example has the logic for exiting the loop clearly visible; the first example has the logic entangled with the control flow. Note that these two programs do exactly the same thing. My programming style takes up a lot of lines of code, but I have never once encountered a complaint about it being hard to understand stylistically, while I find the more condensed approaches frustrating.
An experienced programmer can and will read both - the above may make them pause for a moment and consider what is happening. Forcing the reader to sit down and stare at the code is not a good idea. Code needs to be obvious. Each problem has an intrinsic complexity to expressing its solution. Code should not be more complex than the solution complexity, if at all possible.
That is the essence of what the other poster tried to convey - don't make the program longer than need be. Longer has two meanings: more lines of code (ie, putting braces on their own line), and more complex. Making a program more complex than need be is not good. Making it more readable is good.
Have a look to
97 Things Every Programmer Should Know.
It's free and contains a lot of gems like this one:
There is one quote that I think is
particularly good for all software
developers to know and keep close to
their hearts:
Beauty of style and harmony and grace
and good rhythm depends on simplicity.
— Plato
In one sentence I think this sums up
the values that we as software
developers should aspire to.
There are a number of things we strive
for in our code:
Readability
Maintainability
Speed of development
The elusive quality of beauty
Plato is telling us that the enabling
factor for all of these qualities is
simplicity.
The Python Style Guide is always a good starting point!
European Standards For Writing and Documenting Exchangeable Fortran 90 Code have been in my bookmarks, like forever. Also, there was a thread in here, since you are interested in MATLAB, on organising MATLAB code.
Personally, I've found that I learned more about programming style from working through SICP which is the MIT Intro to Comp SCI text (I'm about a quarter of the way through.) Than any other book. That being said, If you're going to be working in Python, the Google style guide is an excellent place to start.
I read somewhere that most programs (scripts anyways) should never be more than a couple of lines long. All the requisite functionality should be abstracted into functions or classes. I tend to agree.
Many good points have been made above. I definitely second all of the above. I would also like to add that spelling and consistency in coding be something you practice (and also in real life).
I've worked with some offshore teams and though their English is pretty good, their spelling errors caused a lot of confusion. So for instance, if you need to look for some function (e.g., getFeedsFromDatabase) and they spell database wrong or something else, that can be a big or small headache, depending on how many dependencies you have on that particular function. The fact that it gets repeated over and over within the code will first off, drive you nuts, and second, make it difficult to parse.
Also, keep up with consistency in terms of naming variables and functions. There are many protocols to go by but as long as you're consistent in what you do, others you work with will be able to better read your code and be thankful for it.
Pretty much everything said here, and something more. In my opinion the best site concerning what you're looking for (especially the zen of python parts are fun and true)
http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html
Talks about both PEP-20 and PEP-8, some easter eggs (fun stuff), etc...
You can have a look at the Stanford online course: Programming Methodology CS106A. The instructor has given several really good instruction for writing source code.
Some of them are as following:
write programs for people to read, not just for computers to read. Both of them need to be able to read it, but it's far more important that a person reads it and understands it, and that the computer still executes it correctly. But that's the first major software engineering principle to
think about.
How to make comments:
 put in comments to clarify things in the program, which are not obvious
How to make decomposition
One method solves one problem
Each method has code approximate 1~15lines
Give methods good names
Write comment for code
Unit Tests
Python and matlab are dynamic languages. As your code base grows, you will be forced to refactor your code. In contrast to statically typed languages, the compiler will not detect 'broken' parts in your project. Using unit test frameworks like xUnit not only compensate missing compiler checks, they allow refactoring with continuous verification for all parts of your project.
Source Control
Track your source code with a version control system like svn, git or any other derivative. You'll be able to back and forth in your code history, making branches or creating tags for deployed/released versions.
Bug Tracking
Use a bug tracking system, if possible connected with your source control system, in order to stay on top of your issues. You may not be able, or forced, to fix issues right away.
Reduce Entropy
While integrating new features in your existing code base, you will add more lines of code, and potentially more complexity. This will increase entropy. Try to keep your design clean, by introducing an interface, or inheritance hierarchy in order to reduce entropy again. Not paying attention to code entropy will render your code unmaintainable over time.
All of The Above Mentioned
Pure coding related topics, like using a style guide, not duplicating code, ...,
has already been mentioned.
A small addition to the wonderful answers already here regarding Matlab:
Avoid long scripts, instead write functions (sub routines) in separate files. This will make the code more readable and easier to optimize.
Use Matlab's built-in functions capabilities. That is, learn about the many many functions that Matlab offers instead of reinventing the wheel.
Use code sectioning, and whatever the other code structure the newest Matlab version offers.
Learn how to benchmark your code using timeit and profile . You'll discover that sometimes for loops are the better solution.
The best advice I got when I asked this question was as follows:
Never code while drunk.
Make it readable, make it intuitive, make it understandable, and make it commented.

what next after pyparsing?

I have a huge grammar developed for pyparsing as part of a large, pure Python application.
I have reached the limit of performance tweaking and I'm at the point where the diminishing returns make me start to look elsewhere. Yes, I think I know most of the tips and tricks and I've profiled my grammar and my application to dust.
What next?
I hope to find a parser that gives me the same readability, usability (I'm using many advanced features of pyparsing such as parse-actions to start the post processing of the input which is being parsed) and python integration but at 10× the performance.
I love the fact the the grammar is pure Python.
All my basic blocks are regular expressions, so reusing them would be nice.
I know I can't have everything so I am willing to give up on some of the features I have today to get to the requested 10× performance.
Where do I go from here?
It looks like the pyparsing folks have anticipated your problem. From https://github.com/pyparsing/pyparsing/blob/master/docs/HowToUsePyparsing.rst :
Performance of pyparsing may be slow for complex grammars and/or large input strings. The psyco package can be used to improve the speed of the pyparsing module with no changes to grammar or program logic - observed improvments have been in the 20-50% range.
However, as Vangel noted in the comments below, psyco is an obsolete project as of March 2012. Its successor is the PyPy project, which starts from the same basic approach to performance: use a JIT native-code compiler instead of a bytecode interpreter. You should be able to achieve similar or greater gains with PyPy if switching Python implementations will work for you.
If you're really a speed demon, but want to keep some of the legibility and declarative syntax, I'd suggest having a look at ANTLR. Probably not the Python-generating backend; I'm skeptical whether that's mature or high-performance enough for your needs. I'm talking about the goods: the C backend that started it all.
Wrap a Python C extension module around the entry point to the parser, and turn it loose.
Having said that, you'll be giving up a lot in this transition: basically any Python you want to do in your parser will have to be done through the C API (not altogether pretty). Also, you'll have to get used to very different ways of doing things. ANTLR has its charms, but it's not based on combinators, so there's not the easy and fluid relationship between your grammar and your language that there is in pyparsing. Plus, it's its own DSL, much like lex/yacc, which can present a learning curve – but, because it's LL based, you'll probably find it easier to adapt to your needs.
Switch to a generated C/C++ parser (using ANTLR, flex/bison, etc.). If you can delay all the action rules until after you are done parsing, you might be able to build an AST with trivial code and then pass that back to your python code via something like SWIG and process on it with your current actions rules. OTOH, for that to give you a speed boost, the parsing has to be the heavy lifting. If your action rules are the big cost, then this will buy you nothing unless you write your action rules in C as well (but you might have to do it anyway to avoid paying for whatever impedance mismatch you get between the python and C code).
If you really want performance for large grammars, look no farther than SimpleParse (which itself relies on mxTextTools, a C extension). However, know now that it comes at the cost of being more cryptic and requiring that you be well-versed in EBNF.
It's definitely not the more Pythonic route, and you're going to have to start all over with an EBNF grammar to use SimpleParse.
A bit late to the party, but PLY (Python Lex-Yacc), has served me very well. PLY gives you a pure Python framework for constructing lex-based tokenizers, and yacc-based LR parsers.
I went this route when I hit performance issues with pyparsing.
Here is a somewhat old but still interesting article on Python parsing which includes benchmarks for ANTLR, PLY and pyparsing. PLY is roughly 4 times faster than pyparsing in this test.
There's no way to know what kind of benefit you'll get without just testing it, but it's within the range of possibility that you could get 10x benefit just from using Unladen Swallow if your process is long-running and repetitive. (Also, if you have many things to parse and you typically start a new interpreter for each one, Unladen Swallow gets faster - to a point - the longer you run your process, so while parsing one input might not show much gain, you might get significant gains on the 2nd and 3rd inputs in the same process).
(Note: pull the latest out of SVN - you'll get far better performance than the latest tarball)

How can I use Python for large scale development?

I would be interested to learn about large scale development in Python and especially in how do you maintain a large code base?
When you make incompatibility changes to the signature of a method, how do you find all the places where that method is being called. In C++/Java the compiler will find it for you, how do you do it in Python?
When you make changes deep inside the code, how do you find out what operations an instance provides, since you don't have a static type to lookup?
How do you handle/prevent typing errors (typos)?
Are UnitTest's used as a substitute for static type checking?
As you can guess I almost only worked with statically typed languages (C++/Java), but I would like to try my hands on Python for larger programs. But I had a very bad experience, a long time ago, with the clipper (dBase) language, which was also dynamically typed.
Don't use a screw driver as a hammer
Python is not a statically typed language, so don't try to use it that way.
When you use a specific tool, you use it for what it has been built. For Python, it means:
Duck typing : no type checking. Only behavior matters. Therefore your code must be designed to use this feature. A good design means generic signatures, no dependences between components, high abstraction levels.. So if you change anything, you won't have to change the rest of the code. Python will not complain either, that what it has been built for. Types are not an issue.
Huge standard library. You do not need to change all your calls in the program if you use standard features you haven't coded yourself. And Python come with batteries included. I keep discovering them everyday. I had no idea of the number of modules I could use when I started and tried to rewrite existing stuff like everybody. It's OK, you can't get it all right from the beginning.
You don't write Java, C++, Python, PHP, Erlang, whatever, the same way. They are good reasons why there is room for each of so many different languages, they do not do the same things.
Unit tests are not a substitute
Unit tests must be performed with any language. The most famous unit test library (JUnit) is from the Java world!
This has nothing to do with types. You check behaviors, again. You avoid trouble with regression. You ensure your customer you are on tracks.
Python for large scale projects
Languages, libraries and frameworks
don't scale. Architectures do.
If you design a solid architecture, if you are able to make it evolves quickly, then it will scale. Unit tests help, automatic code check as well. But they are just safety nets. And small ones.
Python is especially suitable for large projects because it enforces some good practices and has a lot of usual design patterns built-in. But again, do not use it for what it is not designed. E.g : Python is not a technology for CPU intensive tasks.
In a huge project, you will most likely use several different technologies anyway. As a SGBD (French for DBMS) and a templating language, or else. Python is no exception.
You will probably want to use C/C++ for the part of your code you need to be fast. Or Java to fit in a Tomcat environment. Don't know, don't care. Python can play well with these.
As a conclusion
My answer may feel a bit rude, but don't get me wrong: this is a very good question.
A lot of people come to Python with old habits. I screwed myself trying to code Java like Python. You can, but will never get the best of it.
If you have played / want to play with Python, it's great! It's a wonderful tool. But just a tool, really.
I had some experience with modifying "Frets On Fire", an open source python "Guitar Hero" clone.
as I see it, python is not really suitable for a really large scale project.
I found myself spending a large part of the development time debugging issues related to assignment of incompatible types, things that static typed laguages will reveal effortlessly at compile-time.
also, since types are determined on run-time, trying to understand existing code becomes harder, because you have no idea what's the type of that parameter you are currently looking at.
in addition to that, calling functions using their name string with the __getattr__ built in function is generally more common in Python than in other programming languages, thus getting the call graph to a certain function somewhat hard (although you can call functions with their name in some statically typed languages as well).
I think that Python really shines in small scale software, rapid prototype development, and gluing existing programs together, but I would not use it for large scale software projects, since in those types of programs maintainability becomes the real issue, and in my opinion python is relatively weak there.
Since nobody pointed out pychecker, pylint and similar tools, I will: pychecker and pylint are tools that can help you find incorrect assumptions (about function signatures, object attributes, etc.) They won't find everything that a compiler might find in a statically typed language -- but they can find problems that such compilers for such languages can't find, too.
Python (and any dynamically typed language) is fundamentally different in terms of the errors you're likely to cause and how you would detect and fix them. It has definite downsides as well as upsides, but many (including me) would argue that in Python's case, the ease of writing code (and the ease of making it structurally sound) and of modifying code without breaking API compatibility (adding new optional arguments, providing different objects that have the same set of methods and attributes) make it suitable just fine for large codebases.
my 0.10 EUR:
i have several python application in 'production'-state. our company use java, c++ and python. we develop with the eclipse ide (pydev for python)
unittests are the key-solution for the problem. (also for c++ and java)
the less secure world of "dynamic-typing" will make you less careless about your code quality
BY THE WAY:
large scale development doesn't mean, that you use one single language!
large scale development often uses a handful of languages specific to the problem.
so i agree to the-hammer-problem :-)
PS: static-typing & python
Here are some items that have helped me maintain a fairly large system in python.
Structure your code in layers. i.e separate biz logic, presentation logic and your persistence layers. Invest a bit of time in defining these layers and make sure everyone on the project is brought in. For large systems creating a framework that forces you into a certain way of development can be key as well.
Tests are key, without unit tests you will likely end up with an unmanagable code base several times quicker than with other languages. Keep in mind that unit tests are often not sufficient, make sure to have several integration/acceptance tests you can run quickly after any major change.
Use Fail Fast principle. Add assertions for cases you feel your code maybe vulnerable.
Have standard logging/error handling that will help you quickly navigate to the issue
Use an IDE( pyDev works for me) that provides type ahead, pyLint/Checker integration to help you detect common typos right away and promote some coding standards
Carefull about your imports, never do from x import * or do relative imports without use of .
Do refactor, a search/replace tool with regular expressions is often all you need to do move methods/class type refactoring.
Incompatible changes to the signature of a method. This doesn't happen as much in Python as it does in Java and C++.
Python has optional arguments, default values, and far more flexibility in defining method signatures. Also, duck typing means that -- for example -- you don't have to switch from some class to an interface as part of a significant software change. Things just aren't as complex.
How do you find all the places where that method is being called? grep works for dynamic languages. If you need to know every place a method is used, grep (or equivalent IDE-supported search) works great.
How do you find out what operations an instance provides, since you don't have a static type to lookup?
a. Look at the source. You don't have the Java/C++ problem of object libraries and jar files to contend with. You don't need all the elaborate aids and tools that those languages require.
b. An IDE can provide signature information under many common circumstances. You can, easily, defeat your IDE's reasoning powers. When that happens, you should probably review what you're doing to be sure it makes sense. If your IDE can't reason out your type information, perhaps it's too dynamic.
c. In Python, you often work through the interactive interpreter. Unlike Java and C++, you can explore your instances directly and interactively. You don't need a sophisticated IDE.
Example:
>>> x= SomeClass()
>>> dir(x)
How do you handle/prevent typing errors? Same as static languages: you don't prevent them. You find and correct them. Java can only find a certain class of typos. If you have two similar class or variable names, you can wind up in deep trouble, even with static type checking.
Example:
class MyClass { }
class MyClassx extends MyClass { }
A typo with these two class names can cause havoc. ["But I wouldn't put myself in that position with Java," folks say. Agreed. I wouldn't put myself in that position with Python, either; you make classes that are profoundly different, and will fail early if they're misused.]
Are UnitTest's used as a substitute for static type checking? Here's the other Point of view: static type checking is a substitute for clear, simple design.
I've worked with programmers who weren't sure why an application worked. They couldn't figure out why things didn't compile; the didn't know the difference between abstract superclass and interface, and the couldn't figure out why a change in place makes a bunch of other modules in a separate JAR file crash. The static type checking gave them false confidence in a flawed design.
Dynamic languages allow programs to be simple. Simplicity is a substitute for static type checking. Clarity is a substitute for static type checking.
My general rule of thumb is to use dynamic languages for small non-mission-critical projects and statically-typed languages for big projects. I find that code written in a dynamic language such as python gets "tangled" more quickly. Partly that is because it is much quicker to write code in a dynamic language and that leads to shortcuts and worse design, at least in my case. Partly it's because I have IntelliJ for quick and easy refactoring when I use Java, which I don't have for python.
The usual answer to that is testing testing testing. You're supposed to have an extensive unit test suite and run it often, particularly before a new version goes online.
Proponents of dynamically typed languages make the case that you have to test anyway because even in a statically typed language conformance to the crude rules of the type system covers only a small part of what can potentially go wrong.

Categories

Resources