Python Tail Recursion "Hack" using While Loop

Python Tail Recursion "Hack" using While Loop - python

I've seen a few examples of getting Python to do tail call optimization by using a while True loop. E.g.
def tail_exp(base, exponent, acc=1):
if exponent == 0:
return acc
return tail_exp(base, exponent - 1, acc * base)
becomes
def tail_exp_2(base, exponent, acc=1):
while True:
if exponent == 0:
return acc
exponent, acc = exponent - 1, acc * base
I'm curious to know if this technique is applicable to all/most recursive algorithms in Python, and if there are any downsides or "gotchas" to look out for when optimizing recursive algorithms in this way?

Any recursive algorithm can be replaced by an iterative one. However, some examples will require an additional stack be added to the code, to manage state that is handled by the recursive calls in the original form. With tail recursion, there is no state to be managed, so no separate stack is needed.
Some programming languages take advantage of that fact and design their compilers to optimize out tail calls in recursive code, producing machine code that is equivalent to a loop. Python does not do tail call optimization, so this isn't really relevant to your question. Rewriting code by hand is not tail call optimization, it's just a particular sort of refactoring.
There are a few reasons Python chooses not to do tail call optimization. It's not because it's impossible. Python code is compiled into byte code, so at least theoretically there's an opportunity to translate a recursive call into a loop if that was desired by the developers (in practice it's a little more complicated, since Python variable names are dynamic, you can't necessarily tell if a function name refers to what you expect it to at runtime, a fact use by techniques like monkeypatching). However, the biggest problem with tail call optimization is that it generally overwrites useful debugging information that would usually be preserved by a call stack, like exactly how deep in the recursion you are and the exact state of those previous function calls. The Python developers have decided that they prefer the simplicity and debuggability of normal recursion over performance benefits of tail call optimization, even when the latter is possible.
If you want to rewrite an algorithm from a recursive implementation into an iterative one, you can always do so. In some cases, though, it may get a lot more complicated. Recursive implementations of some algorithms can be a lot shorter, simpler, and easier to reason about, even though iterative equivalents may be faster (and won't hit the recursion limit for large inputs). Converting tail calls into a loop is usually quite simple though. The complicated cases are generally not amenable to tail call optimization either, since they're doing complicated stuff with the values returned by their recursion.

Related

Benefits to recursive function over using loops outside of readability? [duplicate]

I know that recursion is sometimes a lot cleaner than looping, and I'm not asking anything about when I should use recursion over iteration, I know there are lots of questions about that already.
What I'm asking is, is recursion ever faster than a loop? To me it seems like, you would always be able to refine a loop and get it to perform more quickly than a recursive function because the loop is absent constantly setting up new stack frames.
I'm specifically looking for whether recursion is faster in applications where recursion is the right way to handle the data, such as in some sorting functions, in binary trees, etc.

This depends on the language being used. You wrote 'language-agnostic', so I'll give some examples.
In Java, C, and Python, recursion is fairly expensive compared to iteration (in general) because it requires the allocation of a new stack frame. In some C compilers, one can use a compiler flag to eliminate this overhead, which transforms certain types of recursion (actually, certain types of tail calls) into jumps instead of function calls.
In functional programming language implementations, sometimes, iteration can be very expensive and recursion can be very cheap. In many, recursion is transformed into a simple jump, but changing the loop variable (which is mutable) sometimes requires some relatively heavy operations, especially on implementations which support multiple threads of execution. Mutation is expensive in some of these environments because of the interaction between the mutator and the garbage collector, if both might be running at the same time.
I know that in some Scheme implementations, recursion will generally be faster than looping.
In short, the answer depends on the code and the implementation. Use whatever style you prefer. If you're using a functional language, recursion might be faster. If you're using an imperative language, iteration is probably faster. In some environments, both methods will result in the same assembly being generated (put that in your pipe and smoke it).
Addendum: In some environments, the best alternative is neither recursion nor iteration but instead higher order functions. These include "map", "filter", and "reduce" (which is also called "fold"). Not only are these the preferred style, not only are they often cleaner, but in some environments these functions are the first (or only) to get a boost from automatic parallelization — so they can be significantly faster than either iteration or recursion. Data Parallel Haskell is an example of such an environment.
List comprehensions are another alternative, but these are usually just syntactic sugar for iteration, recursion, or higher order functions.

is recursion ever faster than a loop?
No, Iteration will always be faster than Recursion. (in a Von Neumann Architecture)
Explanation:
If you build the minimum operations of a generic computer from scratch, "Iteration" comes first as a building block and is less resource intensive than "recursion", ergo is faster.
Building a pseudo-computing-machine from scratch:
Question yourself: What do you need to compute a value, i.e. to follow an algorithm and reach a result?
We will establish a hierarchy of concepts, starting from scratch and defining in first place the basic, core concepts, then build second level concepts with those, and so on.
First Concept: Memory cells, storage, State. To do something you need places to store final and intermediate result values. Let’s assume we have an infinite array of "integer" cells, called Memory, M[0..Infinite].
Instructions: do something - transform a cell, change its value. alter state. Every interesting instruction performs a transformation. Basic instructions are:
a) Set & move memory cells
store a value into memory, e.g.: store 5 m[4]
copy a value to another position: e.g.: store m[4] m[8]
b) Logic and arithmetic
and, or, xor, not
add, sub, mul, div. e.g. add m[7] m[8]
An Executing Agent: a core in a modern CPU. An "agent" is something that can execute instructions. An Agent can also be a person following the algorithm on paper.
Order of steps: a sequence of instructions: i.e.: do this first, do this after, etc. An imperative sequence of instructions. Even one line expressions are "an imperative sequence of instructions". If you have an expression with a specific "order of evaluation" then you have steps. It means than even a single composed expression has implicit “steps” and also has an implicit local variable (let’s call it “result”). e.g.:
4 + 3 * 2 - 5
(- (+ (* 3 2) 4 ) 5)
(sub (add (mul 3 2) 4 ) 5)
The expression above implies 3 steps with an implicit "result" variable.
// pseudocode
1. result = (mul 3 2)
2. result = (add 4 result)
3. result = (sub result 5)
So even infix expressions, since you have a specific order of evaluation, are an imperative sequence of instructions. The expression implies a sequence of operations to be made in a specific order, and because there are steps, there is also an implicit "result" intermediate variable.
Instruction Pointer: If you have a sequence of steps, you have also an implicit "instruction pointer". The instruction pointer marks the next instruction, and advances after the instruction is read but before the instruction is executed.
In this pseudo-computing-machine, the Instruction Pointer is part of Memory. (Note: Normally the Instruction Pointer will be a “special register” in a CPU core, but here we will simplify the concepts and assume all data (registers included) are part of “Memory”)
Jump - Once you have an ordered number of steps and an Instruction Pointer, you can apply the "store" instruction to alter the value of the Instruction Pointer itself. We will call this specific use of the store instruction with a new name: Jump. We use a new name because is easier to think about it as a new concept. By altering the instruction pointer we're instructing the agent to “go to step x“.
Infinite Iteration: By jumping back, now you can make the agent "repeat" a certain number of steps. At this point we have infinite Iteration.
1. mov 1000 m[30]
2. sub m[30] 1
3. jmp-to 2 // infinite loop
Conditional - Conditional execution of instructions. With the "conditional" clause, you can conditionally execute one of several instructions based on the current state (which can be set with a previous instruction).
Proper Iteration: Now with the conditional clause, we can escape the infinite loop of the jump back instruction. We have now a conditional loop and then proper Iteration
1. mov 1000 m[30]
2. sub m[30] 1
3. (if not-zero) jump 2 // jump only if the previous
// sub instruction did not result in 0
// this loop will be repeated 1000 times
// here we have proper ***iteration***, a conditional loop.
Naming: giving names to a specific memory location holding data or holding a step. This is just a "convenience" to have. We do not add any new instructions by having the capacity to define “names” for memory locations. “Naming” is not a instruction for the agent, it’s just a convenience to us. Naming makes code (at this point) easier to read and easier to change.
#define counter m[30] // name a memory location
mov 1000 counter
loop: // name a instruction pointer location
sub counter 1
(if not-zero) jmp-to loop
One-level subroutine: Suppose there’s a series of steps you need to execute frequently. You can store the steps in a named position in memory and then jump to that position when you need to execute them (call). At the end of the sequence you'll need to return to the point of calling to continue execution. With this mechanism, you’re creating new instructions (subroutines) by composing core instructions.
Implementation: (no new concepts required)
Store the current Instruction Pointer in a predefined memory position
jump to the subroutine
at the end of the subroutine, you retrieve the Instruction Pointer from the predefined memory location, effectively jumping back to the following instruction of the original call
Problem with the one-level implementation: You cannot call another subroutine from a subroutine. If you do, you'll overwrite the returning address (global variable), so you cannot nest calls.
To have a better Implementation for subroutines: You need a STACK
Stack: You define a memory space to work as a "stack", you can “push” values on the stack, and also “pop” the last “pushed” value. To implement a stack you'll need a Stack Pointer (similar to the Instruction Pointer) which points to the actual “head” of the stack. When you “push” a value, the stack pointer decrements and you store the value. When you “pop”, you get the value at the actual Stack Pointer and then the Stack Pointer is incremented.
Subroutines Now that we have a stack we can implement proper subroutines allowing nested calls. The implementation is similar, but instead of storing the Instruction Pointer in a predefined memory position, we "push" the value of the IP in the stack. At the end of the subroutine, we just “pop” the value from the stack, effectively jumping back to the instruction after the original call. This implementation, having a “stack” allows calling a subroutine from another subroutine. With this implementation we can create several levels of abstraction when defining new instructions as subroutines, by using core instructions or other subroutines as building blocks.
Recursion: What happens when a subroutine calls itself?. This is called "recursion".
Problem: Overwriting the local intermediate results a subroutine can be storing in memory. Since you are calling/reusing the same steps, if the intermediate result are stored in predefined memory locations (global variables) they will be overwritten on the nested calls.
Solution: To allow recursion, subroutines should store local intermediate results in the stack, therefore, on each recursive call (direct or indirect) the intermediate results are stored in different memory locations.
...
having reached recursion we stop here.
Conclusion:
In a Von Neumann Architecture, clearly "Iteration" is a simpler/basic concept than “Recursion". We have a form of "Iteration" at level 7, while "Recursion" is at level 14 of the concepts hierarchy.
Iteration will always be faster in machine code because it implies less instructions therefore less CPU cycles.
Which one is "better"?
You should use "iteration" when you are processing simple, sequential data structures, and everywhere a “simple loop” will do.
You should use "recursion" when you need to process a recursive data structure (I like to call them “Fractal Data Structures”), or when the recursive solution is clearly more “elegant”.
Advice: use the best tool for the job, but understand the inner workings of each tool in order to choose wisely.
Finally, note that you have plenty of opportunities to use recursion. You have Recursive Data Structures everywhere, you’re looking at one now: parts of the DOM supporting what you are reading are a RDS, a JSON expression is a RDS, the hierarchical file system in your computer is a RDS, i.e: you have a root directory, containing files and directories, every directory containing files and directories, every one of those directories containing files and directories...

Recursion may well be faster where the alternative is to explicitly manage a stack, like in the sorting or binary tree algorithms you mention.
I've had a case where rewriting a recursive algorithm in Java made it slower.
So the right approach is to first write it in the most natural way, only optimize if profiling shows it is critical, and then measure the supposed improvement.

Tail recursion is as fast as looping. Many functional languages have tail recursion implemented in them.

Most of the answers here are wrong. The right answer is it depends. For example, here are two C functions which walks through a tree. First the recursive one:
static
void mm_scan_black(mm_rc *m, ptr p) {
SET_COL(p, COL_BLACK);
P_FOR_EACH_CHILD(p, {
INC_RC(p_child);
if (GET_COL(p_child) != COL_BLACK) {
mm_scan_black(m, p_child);
}
});
}
And here is the same function implemented using iteration:
static
void mm_scan_black(mm_rc *m, ptr p) {
stack *st = m->black_stack;
SET_COL(p, COL_BLACK);
st_push(st, p);
while (st->used != 0) {
p = st_pop(st);
P_FOR_EACH_CHILD(p, {
INC_RC(p_child);
if (GET_COL(p_child) != COL_BLACK) {
SET_COL(p_child, COL_BLACK);
st_push(st, p_child);
}
});
}
}
It's not important to understand the details of the code. Just that p are nodes and that P_FOR_EACH_CHILD does the walking. In the iterative version we need an explicit stack st onto which nodes are pushed and then popped and manipulated.
The recursive function runs much faster than the iterative one. The reason is because in the latter, for each item, a CALL to the function st_push is needed and then another to st_pop.
In the former, you only have the recursive CALL for each node.
Plus, accessing variables on the callstack is incredibly fast. It means you are reading from memory which is likely to always be in the innermost cache. An explicit stack, on the other hand, has to be backed by malloc:ed memory from the heap which is much slower to access.
With careful optimization, such as inlining st_push and st_pop, I can reach roughly parity with the recursive approach. But at least on my computer, the cost of accessing heap memory is bigger than the cost of the recursive call.
But this discussion is mostly moot because recursive tree walking is incorrect. If you have a large enough tree, you will run out of callstack space which is why an iterative algorithm must be used.

Most answers here forget the obvious culprit why recursion is often slower than iterative solutions. It's linked with the build up and tear down of stack frames but is not exactly that. It's generally a big difference in the storage of the auto variable for each recursion. In an iterative algorithm with a loop, the variables are often held in registers and even if they spill, they will reside in the Level 1 cache. In a recursive algorithm, all intermediary states of the variable are stored on the stack, meaning they will generate many more spills to memory. This means that even if it makes the same amount of operations, it will have a lot memory accesses in the hot loop and what makes it worse, these memory operations have a lousy reuse rate making the caches less effective.
TL;DR recursive algorithms have generally a worse cache behavior than iterative ones.

Consider what absolutely must be done for each, iteration and recursion.
iteration: a jump to beginning of loop
recursion: a jump to beginning of called function
You see that there is not much room for differences here.
(I assume recursion being a tail-call and compiler being aware of that optimization).

In general, no, recursion will not be faster than a loop in any realistic usage that has viable implementations in both forms. I mean, sure, you could code up loops that take forever, but there would be better ways to implement the same loop that could outperform any implementation of the same problem via recursion.
You hit the nail on the head regarding the reason; creating and destroying stack frames is more expensive than a simple jump.
However, do note that I said "has viable implementations in both forms". For things like many sorting algorithms, there tends to not be a very viable way of implementing them that doesn't effectively set up its own version of a stack, due to the spawning of child "tasks" that are inherently part of the process. Thus, recursion may be just as fast as attempting to implement the algorithm via looping.
Edit: This answer is assuming non-functional languages, where most basic data types are mutable. It does not apply to functional languages.

In any realistic system, no, creating a stack frame will always be more expensive than an INC and a JMP. That's why really good compilers automatically transform tail recursion into a call to the same frame, i.e. without the overhead, so you get the more readable source version and the more efficient compiled version. A really, really good compiler should even be able to transform normal recursion into tail recursion where that is possible.

Functional programming is more about "what" rather than "how".
The language implementors will find a way to optimize how the code works underneath, if we don't try to make it more optimized than it needs to be. Recursion can also be optimized within the languages that support tail call optimization.
What matters more from a programmer standpoint is readability and maintainability rather than optimization in the first place. Again, "premature optimization is root of all evil".

This is a guess. Generally recursion probably doesn't beat looping often or ever on problems of decent size if both are using really good algorithms(not counting implementation difficulty) , it may be different if used with a language w/ tail call recursion(and a tail recursive algorithm and with loops also as part of the language)-which would probably have very similar and possibly even prefer recursion some of the time.

According to theory its the same things.
Recursion and loop with the same O() complexity will work with the same theoretical speed, but of course real speed depends on language, compiler and processor.
Example with power of number can be coded in iteration way with O(ln(n)):
int power(int t, int k) {
int res = 1;
while (k) {
if (k & 1) res *= t;
t *= t;
k >>= 1;
}
return res;
}

Here is an example when recursion ran faster than for looping in Java. This is a program which performs Bubble Sort on two arrays. The recBubbleSort(....) method sorts array arr using recursion and bbSort(....) method just uses looping to sort the array narr. The data are same in both the arrays.
public class BBSort_App {
public static void main(String args[]) {
int[] arr = {231,414235,23,543,245,6,324,-32552,-4};
long time = System.nanoTime();
recBubbleSort(arr, arr.length-1, 0);
time = System.nanoTime() - time;
System.out.println("Time Elapsed: "+time+"nanos");
disp(arr);
int[] narr = {231,414235,23,543,245,6,324,-32552,-4};
time = System.nanoTime();
bbSort(narr);
time = System.nanoTime()-time;
System.out.println("Time Elapsed: "+time+"nanos");
disp(narr);
}
static void disp(int[] origin) {
System.out.print("[");
for(int b: origin)
System.out.print(b+", ");
System.out.println("\b\b \b]");
}
static void recBubbleSort(int[] origin, int i, int j) {
if(i>0)
if(j!=i) {
if(origin[i]<origin[j]) {
int temp = origin[i];
origin[i] = origin[j];
origin[j] = temp;
}
recBubbleSort(origin, i, j+1);
}
else
recBubbleSort(origin, i-1, 0);
}
static void bbSort(int[] origin) {
for(int out=origin.length-1;out>0;out--)
for(int in=0;in<out;in++)
if(origin[out]<origin[in]) {
int temp = origin[out];
origin[out] = origin[in];
origin[in] = temp;
}
}
}
Running the test even 50 times gave alomst same results:
The answers given to this question is satisfactory but are without simple examples. Can anybody just give the reason to why this recursion runs faster?

Is it bad practice to use Recursion where it isn't necessary?

In one of my last assignments, I got points docked for using recursion where it was not necessary. Is it bad practice to use Recursion where you don't have to?
For instance, this Python code block could be written two ways:
def test():
if(foo() == 'success!'):
print(True)
else:
test()
or
def test():
while(True):
if(foo() == 'success!'):
print(True)
break
Is one inherently better than another? (Performance-wise or Practice-wise)?

While recursion may allow for an easier-to-read solution, there is a cost. Each function call requires a small amount of memory overhead and set-up time that a loop iteration does not.
The Python call stack is also limited to 1000 nested calls by default; each recursive call counts against that limit, so you risk raising a run-time error with any recursive algorithm. There is no such hard limit to the number of iterations a loop may make.

They're not the same. The iterative version can in theory run forever, once it is entered it doesn't it doesn't change the state of the Python virtual machine. The recursive version, however, keeps expanding the call stack when keeps going. As #chepner mentions in his answer, there's a limit to how long you can keep that up.
For the example you give, you'll notice the difference quickly! As foo never changes, when foo !='success!' the recursive version will raise an exception once you blow out the stack (which won't take long), but the iterative version will "hang". For other functions that actually terminate, between two implementations of the same algorithm, one recursive and the other iterative, the iterative version usually will outperform the recursive one as function calls impose some overhead.
Generally, recursions should bottom out -- typically there's a simplest case they handle outright without further recursive calls (n = 0, list or tuple or dict is empty, etc.) For more complex inputs, recursive calls work on constituent parts of the input (elements of a list, items of a dict, ...), returning solutions to their subproblems which the calling instance combines in some way and returns.
Recursion is analogous, and in many ways related, to mathematical induction -- more generally, well-founded induction. You reason about the correctness of a recursive procedure using induction, as the arguments passed recursively are generally "smaller" than those passed to the caller.
In your example, recursion is pointless: you're not passing data to a nested call, it's not a case of divide-and-conquer, there's no base case at all. It's the wrong mechanism for "infinite"/unbounded loops.

Using recursive function instead of loops [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I’m trying to build a scoring matrix using recursive,
for i in range(1, len(str1))
for j in range(1, len(str))
#do something
my code:
def matrix_bulid(index_1, index_2):
print(index_1, index_2)
if index_1 == len(first_dna) and index_2 == len(second_dna):
return
elif index_2 == len(second_dna):
return matrix_bulid(index_1 + 1, 1)
else:
#do something
matrix_bulid(index_1, index_2 + 1)
but in really long strings , i get max depth error thing.
does anyone has any idea how to do it?

If your goal is to turn your nested for loop into a simple recursive function, you've done that successfully. There are some bugs (bad indentation, not returning anything in the second recursive case, …), but the basic structure is sound.
Unfortunately, recursive functions are bounded by the depth of the stack. Some languages have tail call elimination, which means tail-recursive functions aren't bounded. And your implementation is tail-recursive. But Python (intentionally) does not have that feature, so that doesn't matter; you can still only deal with strings up to sys.getrecursionlimit()+1 in length.
If your strings are bounded, but just a little too big for the default recursion limit (which is 1000 in CPython), you can use sys.setrecursionlimit to set it higher.
There are also tricks to simulate tail call elimination in Python, which I'll discuss below. But even with the best possible trick, your recursive code is still longer, less obvious, less Pythonic, and slower than your nested loop.
If you're doing this for a homework assignment, you're probably done. Unless you were given a choice of programming language, in which case you will need to choose a language with tail call elimination. If you're doing this for real code, you should stick with your original nested loops—the code is simpler, it will run faster, and there is no stack limit, without needing anything hacky or complicated.
The most efficient way to implement tail-call optimization in CPython is by modifying the bytecode of compiled functions, as here. But that's also the hackiest way, and it only works in a restricted set of circumstances.
The simplest way is to modify your function to return a function that returns a value, instead of a value, and then use a trampoline to chain the evaluations together (there are at least two blog posts that show this technique, here and here). But, while that's simplest to implement, it requires changes through your actual recursive functions, which make them more complicated and less readable.
The best tradeoff is probably to use a decorator that inserts a trampoline in the middle of each recursive step, so you don't need to delay function calls explicitly, as seen here. This can get a bit tricky if you have mutually-recursive functions, but it's rarely hard to figure out. And the implementation is pretty easy to understand. And it's not much slower than the bytecode hackery.

Is Python allowed to optimize a function definition to eliminate unused code?

If I defined a function like this:
def ccid_year(seq):
year, prefix, index, suffix = seq
return year
Is Python allowed to optimize it to be effectively:
def ccid_year(seq):
return seq[0]
I'd prefer to write the first function because it documents the format of the data being passed in but would hope that Python would generate code that is effectively as efficient as the second definition.

The two functions are not equivalent:
def ccid_year_1(seq):
year, prefix, index, suffix = seq
return year
def ccid_year_2(seq):
return seq[0]
arg = {1:'a', 2:'b', 0:'c', 3:'d'}
print ccid_year_1(arg)
print ccid_year_2(arg)
The first call prints 0 and the second prints c.

I'll answer the question at face value later, but first: When in doubt, benchmark it! But first, recall that most time is spent in a small portion of the code (i.e., most code is irrelevant to performance!) and, in CPython, function call overhead usually dominates small inefficiencies. Not to mention that large-scale algorithmic inefficiencies (a.k.a. freaking stupid code) dwarfs micro-optimization concerns.
So either don't worry about this at all, or if you have reason to worry about it, first benchmark alternatives and second don't put it in a function. Note that "reasons to worry about it" must be weighted against the time spent worrying, and the maintenance burden (if there is one) of the manual optimization.
CPython, the reference implementation you most like use, is very conservative about optimizing at this level. While there is a peephole optimizer operating on bytecode, it is limited in scale. More generally, you can't expect much optimization crossing a single statement. The problem with statically optimizing Python code is that there's a billion ways even the most innocently-looking program frament can call into arbitrary code, which might do anything at all, so you can't omit these calls.
While we're at it, your proposed optimization is invalid (in the sense that the program doesn't have the same behavior) if seq is of the wrong type (not a sequence, or a very weird sequence) or length (not exactly three items long)! Any program claiming to implement Python must maintain such differences, so it won't do the transformation you suggest literally. I assume this was just an off-hand illustration, but it does indicate you seriously underestimate how complex Python is (to implement, and doubly so to optimize). I and others have written about this at length before, so I'll stop now before this post becomes even larger.
PyPy on the other hand will, if this function is indeed called from a hot loop, probably optimize this and a million other things you didn't even think of, while compiling it down to a machine code loop that iterates faster than any Python loop could ever iterate on CPython. It will still contain a few checks to break out of the loop and take the proper action (e.g. raise an exception) if necessary, but they'll also be highly efficient if not triggered.
I do not know much about IronPython and Jython and other implementations, but if their lack of consistent several-times-faster-than-CPython benchmark results is any indicator, they do not perform significant optimizations. While the VMs IronPython and Jython include JIT compilers (not - but not quite - entirely unlike PyPy's), these JIT compilers are built for very different languages, and I'd be very surprised if they could look through the mess of code IronPython/Jython must execute to achieve Python semantics and perform such optimizations on it.

Explain to me what the big deal with tail-call optimization is and why Python needs it

Apparently, there's been a big brouhaha over whether or not Python needs tail-call optimization (TCO). This came to a head when someone shipped Guido a copy of SICP, because he didn't "get it." I'm in the same boat as Guido. I understand the concept of tail-call optimization. I just can't think of any reason why Python really needs it.
To make this easier for me to understand, what would be a snippet of code that would be greatly simplified using TCO?

Personally, I put great value on tail call optimization; but mainly because it makes recursion as efficient as iteration (or makes iteration a subset of recursion). In minimalistic languages you get huge expressive power without sacrificing performance.
In a 'practical' language (like Python), OTOH, you usually have a lot of other constructions for almost every situation imaginable, so it's less critical. It is always a good thing to have, to allow for unforeseen situations, of course.
Personally, I put great value on tail call optimization; but mainly because it makes recursion as efficient as iteration (or makes iteration a subset of recursion). In minimalistic languages you get huge expressive power without sacrificing performance.
In a 'practical' language (like Python), OTOH, you usually have a lot of other constructions for almost every situation imaginable, so it's less critical. It is always a good thing to have, to allow for unforeseen situations, of course.

If you intensely want to use recursion for things that might alternatively be expressed as loops, then "tail call optimization" is really a must. However, Guido, Python's Benevolent Dictator For Life (BDFL), strongly believes in loops being expressed as loops -- so he's not going to special-case tail calls (sacrificing stack-trace dumps and debugging regularity).

Tail-call optimization makes it easier to write recursive functions without worrying about a stack overflow:
def fac(n, result=1):
if n > 1:
return fac(n - 1, n * result)
return result
Without tail-call optimization, calling this with a big number could overflow the stack.

Guido recognized in a follow up post that TCO allowed a cleaner the implementation of state machine as a collection of functions recursively calling each other. However in the same post he proposes an alternative equally cleaner solution without TCO.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.