Program structure in long running data processing python script

Program structure in long running data processing python script - python

For my current job I am writing some long-running (think hours to days) scripts that do CPU intensive data-processing. The program flow is very simple - it proceeds into the main loop, completes the main loop, saves output and terminates: The basic structure of my programs tends to be like so:
<import statements>
<constant declarations>
<misc function declarations>
def main():
for blah in blahs():
<lots of local variables>
<lots of tightly coupled computation>
for something in somethings():
<lots more local variables>
<lots more computation>
<etc., etc.>
<save results>
if __name__ == "__main__":
main()
This gets unmanageable quickly, so I want to refactor it into something more manageable. I want to make this more maintainable, without sacrificing execution speed.
Each chuck of code relies on a large number of variables however, so refactoring parts of the computation out to functions would make parameters list grow out of hand very quickly. Should I put this sort of code into a python class, and change the local variables into class variables? It doesn't make a great deal of sense tp me conceptually to turn the program into a class, as the class would never be reused, and only one instance would ever be created per instance.
What is the best practice structure for this kind of program? I am using python but the question is relatively language-agnostic, assuming a modern object-oriented language features.

First off, if your program is going to be running for hours/days then the overhead of switching to using classes/methods instead of putting everything in a giant main is pretty much non-existent.
Additionally, refactoring (even if it does involve passing a lot of variables) should help you improve speed in the long run. Profiling an application which is designed well is much easier because you can pin-point the slow parts and optimize there. Maybe a new library comes along that's highly optimized for your calculations... a well designed program will let you plug it in and test right away. Or perhaps you decide to write a C Module extension to improve the speed of a subset of your calculations, a well designed application will make that easy too.
It's hard to give concrete advice without seeing <lots of tightly coupled computation> and <lots more computation>. But, I would start with making every for block it's own method and go from there.

Not too clean, but works well in little projects...
You can start using modules as if they were singleton instances, and create real classes only when you feel the complexity of the module or the computation justifies them.
If you do that, you would want to use "import module" and not "from module import stuff" -- it's cleaner and will work better if "stuff" can be reassigned. Besides, it's recommended in the Google guidelines.

Using a class (or classes) can help you organize your code.
Simplicity of form (such as through use of class attributes and methods) is important because it helps you see your algorithm, and can help you more easily unit test the parts.
IMO, these benefits far outweigh the slight loss of speed that may come with using OOP.

Related

What is the point of Pool.apply() in Python if it blocks

I am trying to understand what the point of Pool.apply() is, within a parallelisation module.
My understanding is it is synchronous, and so no parallel processing occurs - but you just get the overhead of running code in separate processes.

I agree that it's not useful for parallelization, but it might have other uses. For example, if by poor life choices, your function has an unholy mess of side effects like changing global variables (probably because you only intended to execute it in a Pool when writing it), running it in a separate process can help with that.
Of course, using it like that is most likely an anti-pattern, so I guess this function only exists for compatibility reasons...

List out unimplemented python functions without running the application

Is there a way to produce a sensible, readable (or any other kind) list of unimplemented methods/classes using some form of static analysis? Can I skip the painful waiting for processing/network operations to finish for 15 minutes, just to see unimplemented errors one at a time?
Before starting with more serious design decisions, I like to lay out the skeleton of the core of the application. That usually means writing a bunch of imaginary functions that would be implemented eventually. This allows me to think in a conformist way, providing me with an API that's oriented towards easier usage. This is in contrast with the head-first approach of writing the API first with ease of implementation in mind.
While it appears sensible to write the API first, it actually tends to be riddled with bad design decisions. Implementing the core of the application with the idealistic imaginary API tends to provide a clearer idea of how should the API be implemented. Also the tendency is for the API to be less entangled itself.
There is one issue with this however. I usually go all shakespeare on it, so I oft forget what was I wanting to use in the first place. Solution? Run it, see where it cries out about non implemented stuff, implement it, rinse and repeat.
Now, once there's over 50% of the API implemented, rest of it is usually called only after some other heavy processing (which can take up quite a bit of the execution time), leaving me waiting for the not implemented errors. Oh and since python is interpreted, that's where it stops. So I can't really just take the first error log and implement stuff using it as a listing.

Python Profiling - Rollup function calls that are outside my code

I am trying to profile our django unittests (if the tests are faster, we'll run 'em more often). I've ran it through python's built in cProfile profiler, producing a pstats file.
However the signal to noise ration is bad. There are too many functions listed. Lots and lots of django internal functions are called when I make one database query. This makes it hard to see what's going on.
Is there anyway I can "roll up" all function calls that are outside a certain directory?
e.g. if I call a python function outside my directory, and it then calls 5 other functions (all outside my directory), then it should roll all those up, so it looks like there was only one function call, and it should show the cumulative time for the whole thing.
This, obviously, is bad if you want to profile (say) Django, but I don't want to do that.
I looked at the pstats.Stats object, but can't see an obvious way to modify this data.

I have little experience with python, but a lot in performance tuning, so here's a possibility:
Rather than run profiling as part of the unit tests, just do overall execution time measurements. If those don't change much, you're OK.
When you do detect a change, or if you just want to make things run faster, use this method. It has a much higher "signal-to-noise ratio", as you put it.
The difference is you're not hoping the profiler can figure out what you need to be looking at. It's more like a break point in a debugger, where the break occurs not at a place of your choosing, but with good probability at a time of unnecessary slowness.
If on two or more occasions you see it doing something that might be replaced with something better, on average it will pay off, and fixing it will make other problems easier to find by the same method.

Debugging a scripting language like ruby

I am basically from the world of C language programming, now delving into the world of scripting languages like Ruby and Python.
I am wondering how to do debugging.
At present the steps I follow is,
I complete a large script,
Comment everything but the portion I
want to check
Execute the script
Though it works, I am not able to debug like how I would do in, say, a VC++ environment or something like that.
My question is, is there any better way of debugging?
Note: I guess it may be a repeated question, if so, please point me to the answer.

Your sequence seems entirely backwards to me. Here's how I do it:
I write a test for the functionality I want.
I start writing the script, executing bits and verifying test results.
I review what I'd done to document and publish.
Specifically, I execute before I complete. It's way too late by then.
There are debuggers, of course, but with good tests and good design, I've almost never needed one.

Here's a screencast on ruby debugging with ruby-debug.

Seems like the problem here is that your environment (Visual Studio) doesn't support these languages, not that these languages don't support debuggers in general.
Perl, Python, and Ruby all have fully-featured debuggers; you can find other IDEs that help you, too. For Ruby, there's RubyMine; for Perl, there's Komodo. And that's just off the top of my head.

There is a nice gentle introduction to the Python debugger here

If you're working with Python then you can find a list of debugging tools here to which I just want to add Eclipse with the Pydev extension, which makes working with breakpoints etc. also very simple.

My question is, is there any better way of debugging?"
Yes.
Your approach, "1. I complete a large script, 2. Comment everything but the portion I want to check, 3. Execute the script" is not really the best way to write any software in any language (sorry, but that's the truth.)
Do not write a large anything. Ever.
Do this.
Decompose your problem into classes of objects.
For each class, write the class by
2a. Outline the class, focus on the external interface, not the implementation details.
2b. Write tests to prove that interface works.
2c. Run the tests. They'll fail, since you only outlined the class.
2d. Fix the class until it passes the test.
2e. At some points, you'll realize your class designs aren't optimal. Refactor your design, assuring your tests still pass.
Now, write your final script. It should be short. All the classes have already been tested.
3a. Outline the script. Indeed, you can usually write the script.
3b. Write some test cases that prove the script works.
3c. Runt the tests. They may pass. You're done.
3d. If the tests don't pass, fix things until they do.
Write many small things. It works out much better in the long run that writing a large thing and commenting parts of it out.

Script languages have no differences compared with other languages in the sense that you still have to break your problems into manageable pieces -- that is, functions. So, instead of testing the whole script after finishing the whole script, I prefer to test those small functions before integrating them. TDD always helps.

There's a SO question on Ruby IDEs here - and searching for "ruby IDE" offers more.
I complete a large script
That's what caught my eye: "complete", to me, means "done", "finished", "released". Whether or not you write tests before writing the functions that pass them, or whether or not you write tests at all (and I recommend that you do) you should not be writing code that can't be run (which is a test in itself) until it's become large. Ruby and Python offer a multitude of ways to write small, individually-testable (or executable) pieces of code, so that you don't have to wait for (?) days before you can run the thing.
I'm building a (Ruby) database translation/transformation script at the moment - it's up to about 1000 lines and still not done. I seldom go more than 5 minutes without running it, or at least running the part on which I'm working. When it breaks (I'm not perfect, it breaks a lot ;-p) I know where the problem must be - in the code I wrote in the last 5 minutes. Progress is pretty fast.
I'm not asserting that IDEs/debuggers have no place: some problems don't surface until a large body of code is released: it can be really useful on occasion to drop the whole thing into a debugging environment to find out what is going on. When third-party libraries and frameworks are involved it can be extremely useful to debug into their code to locate problems (which are usually - but not always - related to faulty understanding of the library function).

You can debug your Python scripts using the included pdb module. If you want a visual debugger, you can download winpdb - don't be put off by that "win" prefix, winpdb is cross-platform.

The debugging method you described is perfect for a static language like C++, but given that the language is so different, the coding methods are similarly different. One of the big very important things in a dynamic language such as Python or Ruby is the interactive toplevel (what you get by typing, say python on the command line). This means that running a part of your program is very easy.
Even if you've written a large program before testing (which is a bad idea), it is hopefully separated into many functions. So, open up your interactive toplevel, do an import thing (for whatever thing happens to be) and then you can easily start testing your functions one by one, just calling them on the toplevel.
Of course, for a more mature project, you probably want to write out an actual test suite, and most languages have a method to do that (in Python, this is doctest and nose, don't know about other languages). At first, though, when you're writing something not particularly formal, just remember a few simple rules of debugging dynamic languages:
Start small. Don't write large programs and test them. Test each function as you write it, at least cursorily.
Use the toplevel. Running small pieces of code in a language like Python is extremely lightweight: fire up the toplevel and run it. Compare with writing a complete program and the compile-running it in, say, C++. Use that fact that you can quickly change the correctness of any function.
Debuggers are handy. But often, so are print statements. If you're only running a single function, debugging with print statements isn't that inconvenient, and also frees you from dragging along an IDE.

There's a lot of good advice here, i recommend going through some best practices:
http://github.com/edgecase/ruby_koans
http://blog.rubybestpractices.com/
http://on-ruby.blogspot.com/2009/01/ruby-best-practices-mini-interview-2.html
(and read Greg Brown's book, it's superb)
You talk about large scripts. A lot of my workflow is working out logic in irb or the python shell, then capturing them into a cascade of small, single-task focused methods, with appropriate tests (not 100% coverage, more focus on edge and corner cases).
http://binstock.blogspot.com/2008/04/perfecting-oos-small-classes-and-short.html

Encapsulation severely hurts performance?

I know this question is kind of stupid, maybe it's a just a part of writing code but it seems defining simple functions can really hurt performance severely... I've tried this simple test:
def make_legal_foo_string(x):
return "This is a foo string: " + str(x)
def sum_up_to(x):
return x*(x+1)/2
def foo(x):
return [make_legal_foo_string(x),sum_up_to(x),x+1]
def bar(x):
return ''.join([str(foo(x))," -- bar !! "])
it's very good style and makes code clear but it can be three times as slow as just writing it literally. It's inescapable for functions that can have side effects but it's actually almost trivial to define some functions that just should literally be replaced with lines of code every time they appear, translate the source code into that and only then compile. Same I think for magic numbers, it doesn't take a lot of time to read from memory but if they're not supposed to be changed then why not just replace every instance of 'magic' with a literal before the code compiles?

Function call overheads are not big; you won't normally notice them. You only see them in this case because your actual code (x*x) is itself so completely trivial. In any real program that does real work, the amount of time spent in function-calling overhead will be negligably small.
(Not that I'd really recommend using foo, identity and square in the example, in any case; they're so trivial it's just as readable to have them inline and they don't really encapsulate or abstract anything.)
if they're not supposed to be changed then why not just replace every instance of 'magic' with a literal before the code compiles?
Because programs are written for to be easy for you to read and maintain. You could replace constants with their literal values, but it'd make the program harder to work with, for a benefit that is so tiny you'll probably never even be able to measure it: the height of premature optimisation.

Encapsulation is about one thing and one thing only: Readability. If you're really so worried about performance that you're willing to start stripping out encapsulated logic, you may as well just start coding in assembly.
Encapsulation also assists in debugging and feature adding. Consider the following: lets say you have a simple game and need to add code that depletes the players health under some circumstances. Easy, yes?
def DamagePlayer(dmg):
player.health -= dmg;
This IS very trivial code, so it's very tempting to simply scatter "player.health -=" everywhere. But what if later you want to add a powerup that halves damage done to the player while active? If the logic is still encapsulated, it's easy:
def DamagePlayer(dmg):
if player.hasCoolPowerUp:
player.health -= dmg / 2
else
player.health -= dmg
Now, consider if you had neglected to encapulate that bit of logic because of it's simplicity. Now you're looking at coding the same logic into 50 different places, at least one of which you are almost certain to forget, which leads to weird bugs like: "When player has powerup all damage is halved except when hit by AlienSheep enemies..."
Do you want to have problems with Alien Sheep? I don't think so. :)
In all seriousness, the point I'm trying to make is that encapsulation is a very good thing in the right circumstances. Of course, it's also easy to over-encapsulate, which can be just as problematic. Also, there are situations where the speed really truly does matter (though they are rare), and that extra few clock cycles is worth it. About the only way to find the right balance is practice. Don't shun encapsulation because it's slower, though. The benefits usually far outweigh the costs.

I don't know how good python compilers are, but the answer to this question for many languages is that the compiler will optimize calls to small procedures / functions / methods by inlining them. In fact, in some language implementations you generally get better performance by NOT trying to "micro-optimize" the code yourself.

What you are talking about is the effect of inlining functions for gaining efficiency.
It is certainly true in your Python example, that encapsulation hurts performance. But there are some counter example to it:
In Java, defining getter&setter instead of defining public member variables does not result in performance degradation as the JIT inline the getters&setters.
sometimes calling a function repetedly may be better than performing inlining as the code executed may then fit in the cache. Inlining may cause code explosion...

Figuring out what to make into a function and what to just include inline is something of an art. Many factors (performance, readability, maintainability) feed into the equation.
I actually find your example kind of silly in many ways - a function that just returns it's argument? Unless it's an overload that's changing the rules, it's stupid. A function to square things? Again, why bother. Your function 'foo' should probably return a string, so that it can be used directly:
''.join(foo(x)," -- bar !! "])
That's probably a more correct level of encapsulation in this example.
As I say, it really depends on the circumstances. Unfortunately, this is the sort of thing that doesn't lend itself well to examples.

IMO, this is related to Function Call Costs. Which are negligible usually, but not zero. Splitting the code in a lot of very small functions may hurt. Especially in interpreted languages where full optimization is not available.
Inlining functions may improve performance but it may also deteriorate. See, for example, C++ FQA Lite for explanations (“Inlining can make code faster by eliminating function call overhead, or slower by generating too much code, causing instruction cache misses”). This is not C++ specific. Better leave optimizations for compiler/interpreter unless they are really necessary.
By the way, I don't see a huge difference between two versions:
$ python bench.py
fine-grained function decomposition: 5.46632194519
one-liner: 4.46827578545
$ python --version
Python 2.5.2
I think this result is acceptable. See bench.py in the pastebin.

There is a performance hit for using functions, since there is the overhead of jumping to a new address, pushing registers on the stack, and returning at the end. This overhead is very small, however, and even in peformance critial systems worrying about such overhead is most likely premature optimization.
Many languages avoid these issues in small frequenty called functions by using inlining, which is essentially what you do above.
Python does not do inlining. The closest thing you can do is use macros to replace the function calls.
This sort of performance issue is better served by another language, if you need the sort of speed gained by inlining (mostly marginal, and sometimes detremental) then you need to consider not using python for whatever you are working on.

There is a good technical reason why what you suggested is impossible. In Python, functions, constants, and everything else are accessible at runtime and can be changed at any time if necessary; they could also be changed by outside modules. This is an explicit promise of Python and one would need some extremely important reasons to break it.
For example, here's the common logging idiom:
# beginning of the file xxx.py
log = lambda *x: None
def something():
...
log(...)
...
(here log does nothing), and then at some other module or at the interactive prompt:
import xxx
xxx.log = print
xxx.something()
As you see, here log is modified by completely different module --- or by the user --- so that logging now works. That would be impossible if log was optimized away.
Similarly, if an exception was to happen in make_legal_foo_string (this is possible, e.g. if x.__str__() is broken and returns None) you'd be hit with a source quote from a wrong line and even perhaps from a wrong file in your scenario.
There are some tools that in fact apply some optimizations to Python code, but I don't think of the kind you suggested.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.