Is Haskell's laziness an elegant alternative to Python's generators?

Is Haskell's laziness an elegant alternative to Python's generators? - python

In a programming exercise, it was first asked to program the factorial function and then calculate the sum: 1! + 2! + 3! +... n! in O(n) multiplications (so we can't use the factorial directly). I am not searching the solution to this specific (trivial) problem, I'm trying to explore Haskell abilities and this problem is a toy I would like to play with.
I thought Python's generators could be a nice solution to this problem. For example :
from itertools import islice
def ifact():
i , f = 1, 1
yield 1
while True:
f *= i
i += 1
yield f
def sum_fact(n):
return sum(islice(ifact(),5))
Then I've tried to figure out if there was something in Haskell having a similar behavior than this generator and I thought that laziness do all the staff without any additional concept.
For example, we could replace my Python ifact with
fact = scan1 (*) [1..]
And then solve the exercise with the following :
sum n = foldl1 (+) (take n fact)
I wonder if this solution is really "equivalent" to Python's one regarding time complexity and memory usage. I would say that Haskell's solution never store all the list fact since their elements are used only once.
Am I right or totally wrong ?
EDIT :
I should have check more precisely:
Prelude> foldl1 (+) (take 4 fact)
33
Prelude> :sprint fact
fact = 1 : 2 : 6 : 24 : _
So (my implementation of) Haskell store the result, even if it's no longer used.

Indeed, lazy lists can be used this way. There are some subtle differences though:
Lists are data structures. So you can keep them after evaluating them, which can be both good and bad (you can avoid recomputation of values and to recursive tricks as #ChrisDrost described, at the cost of keeping memory unreleased).
Lists are pure. In generators you can have computations with side effects, you can't do that with lists (which is often desirable).
Since Haskell is a lazy language, laziness is everywhere and if you just convert a program from an imperative language to Haskell, the memory requirements can change considerably (as #RomanL describes in his answer).
But Haskell offers more advanced tools to accomplish the generator/consumer pattern. Currently there are three libraries that focus on this problem: pipes, conduit and iteratees. My favorite is conduit, it's easy to use and the complexity of its types is kept low.
They have several advantages, in particular that you can create complex pipelines and you can base them on a chosen monad, which allows you to say what side effects are allowed in a pipeline.
Using conduit, your example could be expressed as follows:
import Data.Functor.Identity
import Data.Conduit
import qualified Data.Conduit.List as C
ifactC :: (Num a, Monad m) => Producer m a
ifactC = loop 1 1
where
loop r n = let r' = r * n
in yield r' >> loop r' (n + 1)
sumC :: (Num a, Monad m) => Consumer a m a
sumC = C.fold (+) 0
main :: IO ()
main = (print . runIdentity) (ifactC $= C.isolate 5 $$ sumC)
-- alternatively running the pipeline in IO monad directly:
-- main = (ifactC $= C.isolate 5 $$ sumC) >>= print
Here we create a Producer (a conduit that consumes no input) that yields factorials indefinitely. Then we compose it with isolate, which ensures that no more than a given number of values are propagated through it, and then we compose it with a Consumer that just sums values and returns the result.

Your examples are not equivalent in memory usage. It is easy to see if you replace * with a + (so that the numbers don't get big too quickly) and then run both examples on a big n such as 10^7. Your Haskell version will consume a lot of memory and python will keep it low.
Python generator will not generate a list of values then sum it up. Instead, the sum function will get values one-by-one from the generator and accumulate them. Thus, the memory usage will remain constant.
Haskell will evaluate functions lazily, but in order to calculate say foldl1 (+) (take n fact) it will have to evaluate the complete expression. For large n this will unfold into a huge expression the same way as (foldl (+) 0 [0..n]) does. For more details on evaluation and reduction have a look here: https://www.haskell.org/haskellwiki/Foldr_Foldl_Foldl%27.
You can fix your sum n by using foldl1' instead of foldl1 as described on the link above. As #user2407038 explained in his comment, you'd also need to keep fact local. The following works in GHC with a constant memory use:
let notfact = scanl1 (+) [1..]
let n = 20000000
let res = foldl' (+) 0 (take n notfact)
Note that in case of the actual factorial in place of notfact memory considerations are less of a concern. The numbers will get big quickly, arbitrary-precision arithmetic will slow things down so you won't be able get to big values of n in order to actually see the difference.

Basically, yes: Haskell's lazy-lists are a lot like Python's generators, if those generators were effortlessly cloneable, cacheable, and composeable. Instead of raising StopIteration you return [] from your recursive function, which can thread state into the generator.
They do some cooler stuff due to self-recursion. For example, your factorial generator is more idiomatically generated like:
facts = 1 : zipWith (*) facts [1..]
or the Fibonaccis as:
fibs = 1 : 1 : zipWith (+) fibs (tail fibs)
In general any iterative loop can be converted to a recursive algorithm by promoting the loop-state to arguments of a function and then calling it recursively to get the next loop cycle. Generators are just like that, but we prepend some elements each iteration of the recursive function, `go ____ = (stuff) : go ____.
The perfect equivalent is therefore:
ifact :: [Integer]
ifact = go 1 1
where go f i = f : go (f * i) (i + 1)
sum_fact n = sum (take n ifact)
In terms of what's fastest, the absolute fastest in Haskell will probably be the "for loop":
sum_fact n = go 1 1 1
where go acc fact i
| i <= n = go (acc + fact) (fact * i) (i + 1)
| otherwise = acc
The fact that this is "tail-recursive" (a call of go does not pipe any sub-calls to go to another function like (+) or (*)) means that the compiler can package it into a really tight loop, and that's why I'm comparing it with "for loops" even though that's not really a native idea to Haskell.
The above sum_fact n = sum (take n ifact) is a little slower than this but faster than sum (take n facts) where facts is defined with zipWith. The speed differences are not very large and I think mostly just come down to memory allocations that don't get used again.

Related

Why `s = [x for x in "hello"]; sum(s)` doesn't work in Python [duplicate]

Closed. This question is opinion-based. It is not currently accepting answers.
Closed 4 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
Python has a built in function sum, which is effectively equivalent to:
def sum2(iterable, start=0):
return start + reduce(operator.add, iterable)
for all types of parameters except strings. It works for numbers and lists, for example:
sum([1,2,3], 0) = sum2([1,2,3],0) = 6 #Note: 0 is the default value for start, but I include it for clarity
sum({888:1}, 0) = sum2({888:1},0) = 888
Why were strings specially left out?
sum( ['foo','bar'], '') # TypeError: sum() can't sum strings [use ''.join(seq) instead]
sum2(['foo','bar'], '') = 'foobar'
I seem to remember discussions in the Python list for the reason, so an explanation or a link to a thread explaining it would be fine.
Edit: I am aware that the standard way is to do "".join. My question is why the option of using sum for strings was banned, and no banning was there for, say, lists.
Edit 2: Although I believe this is not needed given all the good answers I got, the question is: Why does sum work on an iterable containing numbers or an iterable containing lists but not an iterable containing strings?

Python tries to discourage you from "summing" strings. You're supposed to join them:
"".join(list_of_strings)
It's a lot faster, and uses much less memory.
A quick benchmark:
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = reduce(operator.add, strings)'
100 loops, best of 3: 8.46 msec per loop
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = "".join(strings)'
1000 loops, best of 3: 296 usec per loop
Edit (to answer OP's edit): As to why strings were apparently "singled out", I believe it's simply a matter of optimizing for a common case, as well as of enforcing best practice: you can join strings much faster with ''.join, so explicitly forbidding strings on sum will point this out to newbies.
BTW, this restriction has been in place "forever", i.e., since the sum was added as a built-in function (rev. 32347)

You can in fact use sum(..) to concatenate strings, if you use the appropriate starting object! Of course, if you go this far you have already understood enough to use "".join(..) anyway..
>>> class ZeroObject(object):
... def __add__(self, other):
... return other
...
>>> sum(["hi", "there"], ZeroObject())
'hithere'

Here's the source: http://svn.python.org/view/python/trunk/Python/bltinmodule.c?revision=81029&view=markup
In the builtin_sum function we have this bit of code:
/* reject string values for 'start' parameter */
if (PyObject_TypeCheck(result, &PyBaseString_Type)) {
PyErr_SetString(PyExc_TypeError,
"sum() can't sum strings [use ''.join(seq) instead]");
Py_DECREF(iter);
return NULL;
}
Py_INCREF(result);
}
So.. that's your answer.
It's explicitly checked in the code and rejected.

From the docs:
The preferred, fast way to concatenate a
sequence of strings is by calling
''.join(sequence).
By making sum refuse to operate on strings, Python has encouraged you to use the correct method.

Short answer: Efficiency.
Long answer: The sum function has to create an object for each partial sum.
Assume that the amount of time required to create an object is directly proportional to the size of its data. Let N denote the number of elements in the sequence to sum.
doubles are always the same size, which makes sum's running time O(1)×N = O(N).
int (formerly known as long) is arbitary-length. Let M denote the absolute value of the largest sequence element. Then sum's worst-case running time is lg(M) + lg(2M) + lg(3M) + ... + lg(NM) = N×lg(M) + lg(N!) = O(N log N).
For str (where M = the length of the longest string), the worst-case running time is M + 2M + 3M + ... + NM = M×(1 + 2 + ... + N) = O(N²).
Thus, summing strings would be much slower than summing numbers.
str.join does not allocate any intermediate objects. It preallocates a buffer large enough to hold the joined strings, and copies the string data. It runs in O(N) time, much faster than sum.

The Reason Why
#dan04 has an excellent explanation for the costs of using sum on large lists of strings.
The missing piece as to why str is not allowed for sum is that many, many people were trying to use sum for strings, and not many use sum for lists and tuples and other O(n**2) data structures. The trap is that sum works just fine for short lists of strings, but then gets put in production where the lists can be huge, and the performance slows to a crawl. This was such a common trap that the decision was made to ignore duck-typing in this instance, and not allow strings to be used with sum.

Edit: Moved the parts about immutability to history.
Basically, its a question of preallocation. When you use a statement such as
sum(["a", "b", "c", ..., ])
and expect it to work similar to a reduce statement, the code generated looks something like
v1 = "" + "a" # must allocate v1 and set its size to len("") + len("a")
v2 = v1 + "b" # must allocate v2 and set its size to len("a") + len("b")
...
res = v10000 + "$" # must allocate res and set its size to len(v9999) + len("$")
In each of these steps a new string is created, which for one might give some copying overhead as the strings are getting longer and longer. But that’s maybe not the point here. What’s more important, is that every new string on each line must be allocated to it’s specific size (which. I don’t know it it must allocate in every iteration of the reduce statement, there might be some obvious heuristics to use and Python might allocate a bit more here and there for reuse – but at several points the new string will be large enough that this won’t help anymore and Python must allocate again, which is rather expensive.
A dedicated method like join, however has the job to figure out the real size of the string before it starts and would therefore in theory only allocate once, at the beginning and then just fill that new string, which is much cheaper than the other solution.

I dont know why, but this works!
import operator
def sum_of_strings(list_of_strings):
return reduce(operator.add, list_of_strings)

Reducing the time complexity of stair-step question (Amazon interview question)

def step(n):
if (n==0) or (n==1):
return 1
elif n==2:
return 2
else:
return step(n-1) + step(n-2) + step(n-3)
n = int(input())
print(step(n))
For input 53798080 it is taking 1 second. It should take a lot lesser than that to satisfy the test case.

This type of problem - evaluating a recurrence relation - has had a lot of smart people study it over the years, and that means that there's a ton of cool insights and ideas you can use to speed things up.
The comments have done a great job identifying why your code slows down on large inputs - it's because you're generating lots of duplicate recursive calls. The question, then, is how to address this.
If you want to keep your same basic strategy, I would recommend using memoization. If you haven't seen this technique before, the basic idea is to have the recursion keep track of calls that have already been made and to cache the results of those calls. Then, if you try solving the same problem twice, you can just hand back the cached result.
The general template for memoization looks something like this. (It's in pseudocode, but shouldn't be too hard to adapt.)
def memoized_recursion(original_args, memoization_table):
if memoization_table contains original_args):
return memoization_table[original_args]
else
# Put the rest of your recursive code here.
# Before returning a result, store it in memoization_table.
This reduces the number of recursive calls pretty dramatically, speeding up your code.
This, of course, isn't the only solution to making your code faster. If you have to keep things recursive, there's a different insight you can use that fundamentally changes the strategy. The basic idea is this. You're generating a series of numbers that looks like this:
1, 1, 2, 4, 7, 13, 24, ...
The idea is that
the first three terms are 1, 1, 2;
every term past this one is the sum of the three previous numbers; and
you want the nth term of the series.
If you need terms 0, 1, or 2, you can just read the answer off because you know the first three numbers.
If not, here's another technique you can use. Rather than getting the three previous values and adding them together, use this useful fact: asking for the nth term of the series starting with 1, 1, 2 is equivalent to asking for the (n-1)st term of the series starting with 1, 2, 4. (Do you see why?)
More generally, if the first three terms of the series are a, b, and c and you want the nth term, you can ask for the (n-1)st term of the series starting with the sequence b, c, a + b + c. This gives a different recursive strategy where the recursion doesn't branch, meaning that you don't need memoization.
And now, one final strategy. The type of problem you're solving involves something called a homogeneous linear recurrence relation. That is, you have a recurrence of the form
a0, a1, ..., and ak-1 are fixed constants, and
an+k = c0 an + c1 an+1 + ... + ck-1an+k-1.
This recurrence includes things like the Fibonacci sequence, the Pell numbers, the Padovan sequence, etc.
It turns out that in any case where you're solving a recurrence like this, you can solve the problem by raising specifically-chosen matrices to specific powers. In your case, the basic idea is related to the one for the second recursive strategy. The idea is that if the last three terms of the sequence are a, b, and c, then you know that the the next term is a + b + c, and the two terms before this are b and c. In other words, you can think of a mapping that turns (a, b, c) into (b, c, a + b + c). This can be thought of as this matrix equation:
| 0 1 0 | |a| | b |
| 0 0 1 | |b| = | c |
| 1 1 1 | |c| | a + b + c |
If you let M be the matrix on the far left, then computing Mn and multiplying it by the column vector (a, b, c) will give you the nth, (n+1)st, and (n+2)nd terms of the recurrence relation. This gives a totally different strategy for solving the problem: build a matrix, then raise it to a large power!
You can do this very efficiently, in fact. There's a (recursive) technique called exponentiation by squaring that can compute the nth power of a matrix using only O(log n) multiplications. (The entries of the matrix will start to get pretty big, and multiplying them will start to be your bottleneck, unfortunately). It might be worth checking out this strategy, though, since it's a pretty cool technique!
And, finally, one last option. If you do some Googling, you'll find that your problem is closely related to finding the nth tribonacci number. There are some cool formulas you can use to compute this directly, also involving powers of numbers, though they might introduce some rounding errors that slow things down a bit too much for your purposes.

For input 53798080 it is taking 1 second.
I highly doubt this. Your code stack overflows on this input. What I believe is going on here is that it takes 1 second for input 30 to produce the output 53798080. By input 31, we're up to nearly half a minute.
If we memoize your code:
from functools import lru_cache
#lru_cache
def step(n):
if n == 0 or n == 1:
return 1
if n == 2:
return 2
return step(n-1) + step(n-2) + step(n-3)
It fixes the speed problem, as #templatetypedef explains. But, it blows up with stack overflow (assuming you don't allocate more stack) above input 500. We can double that range, and deal with the speed problem sans memoization, using a more efficient algorithm with fewer recursions:
def step(n, prev1=2, prev2=1, prev3=1):
if 0 <= n <= 1:
return 1
if n == 2:
return prev1
return step(n - 1, prev3 + prev2 + prev1, prev1, prev2)
This will handle input up to 999 and produce a result in a fraction of a second:
> time python3 test.py
1499952522327196729941271196334368245775697491582778125787566254148069690528296568742385996324542810615783529390195412125034236407070760756549390960727215226685972723347839892057807887049540341540394345570010550821354375819311674972209464069786275283520364029575324
0.032u 0.011s 0:00.04 100.0% 0+0k 0+0io 0pf+0w
>
(Adding #lru_cache to this code will reduce the input range back to the original and make no difference speed-wise.)

What is optimal algorithm to check if a given integer is equal to sum of two elements of an int array?

def check_set(S, k):
S2 = k - S
set_from_S2=set(S2.flatten())
for x in S:
if(x in set_from_S2):
return True
return False
I have a given integer k. I want to check if k is equal to sum of two element of array S.
S = np.array([1,2,3,4])
k = 8
It should return False in this case because there are no two elements of S having sum of 8. The above code work like 8 = 4 + 4 so it returned True
I can't find an algorithm to solve this problem with complexity of O(n).
Can someone help me?

You have to account for multiple instances of the same item, so set is not good choice here.
Instead you can exploit dictionary with value_field = number_of_keys (as variant - from collections import Counter)
A = [3,1,2,3,4]
Cntr = {}
for x in A:
if x in Cntr:
Cntr[x] += 1
else:
Cntr[x] = 1
#k = 11
k = 8
ans = False
for x in A:
if (k-x) in Cntr:
if k == 2 * x:
if Cntr[k-x] > 1:
ans = True
break
else:
ans = True
break
print(ans)
Returns True for k=5,6 (I added one more 3) and False for k=8,11

Adding onto MBo's answer.
"Optimal" can be an ambiguous term in terms of algorithmics, as there is often a compromise between how fast the algorithm runs and how memory-efficient it is. Sometimes we may also be interested in either worst-case resource consumption or in average resource consumption. We'll loop at worst-case here because it's simpler and roughly equivalent to average in our scenario.
Let's call n the length of our array, and let's consider 3 examples.
Example 1
We start with a very naive algorithm for our problem, with two nested loops that iterate over the array, and check for every two items of different indices if they sum to the target number.
Time complexity: worst-case scenario (where the answer is False or where it's True but that we find it on the last pair of items we check) has n^2 loop iterations. If you're familiar with the big-O notation, we'll say the algorithm's time complexity is O(n^2), which basically means that in terms of our input size n, the time it takes to solve the algorithm grows more or less like n^2 with multiplicative factor (well, technically the notation means "at most like n^2 with a multiplicative factor, but it's a generalized abuse of language to use it as "more or less like" instead).
Space complexity (memory consumption): we only store an array, plus a fixed set of objects whose sizes do not depend on n (everything Python needs to run, the call stack, maybe two iterators and/or some temporary variables). The part of the memory consumption that grows with n is therefore just the size of the array, which is n times the amount of memory required to store an integer in an array (let's call that sizeof(int)).
Conclusion: Time is O(n^2), Memory is n*sizeof(int) (+O(1), that is, up to an additional constant factor, which doesn't matter to us, and which we'll ignore from now on).
Example 2
Let's consider the algorithm in MBo's answer.
Time complexity: much, much better than in Example 1. We start by creating a dictionary. This is done in a loop over n. Setting keys in a dictionary is a constant-time operation in proper conditions, so that the time taken by each step of that first loop does not depend on n. Therefore, for now we've used O(n) in terms of time complexity. Now we only have one remaining loop over n. The time spent accessing elements our dictionary is independent of n, so once again, the total complexity is O(n). Combining our two loops together, since they both grow like n up to a multiplicative factor, so does their sum (up to a different multiplicative factor). Total: O(n).
Memory: Basically the same as before, plus a dictionary of n elements. For the sake of simplicity, let's consider that these elements are integers (we could have used booleans), and forget about some of the aspects of dictionaries to only count the size used to store the keys and the values. There are n integer keys and n integer values to store, which uses 2*n*sizeof(int) in terms of memory. Add to that what we had before and we have a total of 3*n*sizeof(int).
Conclusion: Time is O(n), Memory is 3*n*sizeof(int). The algorithm is considerably faster when n grows, but uses three times more memory than example 1. In some weird scenarios where almost no memory is available (embedded systems maybe), this 3*n*sizeof(int) might simply be too much, and you might not be able to use this algorithm (admittedly, it's probably never going to be a real issue).
Example 3
Can we find a trade-off between Example 1 and Example 2?
One way to do that is to replicate the same kind of nested loop structure as in Example 1, but with some pre-processing to replace the inner loop with something faster. To do that, we sort the initial array, in place. Done with well-chosen algorithms, this has a time-complexity of O(n*log(n)) and negligible memory usage.
Once we have sorted our array, we write our outer loop (which is a regular loop over the whole array), and then inside that outer loop, use dichotomy to search for the number we're missing to reach our target k. This dichotomy approach would have a memory consumption of O(log(n)), and its time complexity would be O(log(n)) as well.
Time complexity: The pre-processing sort is O(n*log(n)). Then in the main part of the algorithm, we have n calls to our O(log(n)) dichotomy search, which totals to O(n*log(n)). So, overall, O(n*log(n)).
Memory: Ignoring the constant parts, we have the memory for our array (n*sizeof(int)) plus the memory for our call stack in the dichotomy search (O(log(n))). Total: n*sizeof(int) + O(log(n)).
Conclusion: Time is O(n*log(n)), Memory is n*sizeof(int) + O(log(n)). Memory is almost as small as in Example 1. Time complexity is slightly more than in Example 2. In scenarios where the Example 2 cannot be used because we lack memory, the next best thing in terms of speed would realistically be Example 3, which is almost as fast as Example 2 and probably has enough room to run if the very slow Example 1 does.
Overall conclusion
This answer was just to show that "optimal" is context-dependent in algorithmics. It's very unlikely that in this particular example, one would choose to implement Example 3. In general, you'd see either Example 1 if n is so small that one would choose whatever is simplest to design and fastest to code, or Example 2 if n is a bit larger and we want speed. But if you look at the wikipedia page I linked for sorting algorithms, you'll see that none of them is best at everything. They all have scenarios where they could be replaced with something better.

I don't understand why/how one of these methods is faster than the others

I wanted to test the difference in time between implementations of some simple code. I decided to count how many values out of a random sample of 10,000,000 numbers is greater than 0.5. The random sample is grabbed uniformly from the range [0.0, 1.0).
Here is my code:
from numpy.random import random_sample; import time;
n = 10000000;
t1 = time.clock();
t = 0;
z = random_sample(n);
for x in z:
if x > 0.5: t += 1;
print t;
t2 = time.clock();
t = 0;
for _ in xrange(n):
if random_sample() > 0.5: t += 1;
print t;
t3 = time.clock();
t = (random_sample(n) > 0.5).sum();
print t;
t4 = time.clock();
print t2-t1; print t3-t2; print t4-t3;
This is the output:
4999445
4999511
5001498
7.0348236652
1.75569394301
0.202538106332
I get that the first implementation sucks because creating a massive array and then counting it element-wise is a bad idea, so I thought that the second implementation would be the most efficient.
But how is the third implementation 10 times faster than the second method? Doesn't the third method also create a massive array in the form of random_sample(n) and then go through it checking values against 0.5?
How is this third method different from the first method and why is it ~35 times faster than the first method?
EDIT: #merlin2011 suggested that Method 3 probably doesn't create the full array in memory. So, to test that theory I tried the following:
z = random_sample(n);
t = (z > 0.5).sum();
print t;
which runs in a time of 0.197948451549 which is practically identical to Method 3. So, this is probably not a factor.

Method 1 generates a full list in memory before using it. This is slow because the memory has to be allocated and then accessed, probably missing the cache multiple times.
Method 2 uses an generator, which never creates the list in memory but instead generates each element on demand.
Method 3 is probably faster because sum() is implemented as a loop in C but I am not 100% sure. My guess is that this is faster for the same reason that Matlab vectorization is faster than for loops in Matlab.
Update: Separating out each of three steps, I observe that method 3 is still equally fast, so I have to agree with utdemir that each individual operator is executing instructions closer to machine code.
z = random_sample(n)
z2 = z > 0.5
t = z2.sum();
In each of the first two methods, you are invoking Python's standard functionality to do a loop, and this is much slower than a C-level loop that is baked into the implementation.

AFAIK
Function calls are heavy, on method two, you're calling random_sample() 10000000 times, but on third method, you just call it once.
Numpy's > and .sum are optimized to their last bits in C, also most probably using SIMD instructions to avoid loops.
So,
On method 2, you are comparing and looping using Python; but on method 3, you're much closer to the processor and using optimized instructions to compare and sum.

Recursive generation + filtering. Better non-recursive?

I have the following need (in python):
generate all possible tuples of length 12 (could be more) containing either 0, 1 or 2 (basically, a ternary number with 12 digits)
filter these tuples according to specific criteria, culling those not good, and keeping the ones I need.
As I had to deal with small lengths until now, the functional approach was neat and simple: a recursive function generates all possible tuples, then I cull them with a filter function. Now that I have a larger set, the generation step is taking too much time, much longer than needed as most of the paths in the solution tree will be culled later on, so I could skip their creation.
I have two solutions to solve this:
derecurse the generation into a loop, and apply the filter criteria on each new 12-digits entity
integrate the filtering in the recursive algorithm, so to prevent it stepping into paths that are already doomed.
My preference goes to 1 (seems easier) but I would like to hear your opinion, in particular with an eye towards how a functional programming style deals with such cases.

How about
import itertools
results = []
for x in itertools.product(range(3), repeat=12):
if myfilter(x):
results.append(x)
where myfilter does the selection. Here, for example, only allowing result with 10 or more 1's,
def myfilter(x): # example filter, only take lists with 10 or more 1s
return x.count(1)>=10
That is, my suggestion is your option 1. For some cases it may be slower because (depending on your criteria) you many generate many lists that you don't need, but it's much more general and very easy to code.
Edit: This approach also has a one-liner form, as suggested in the comments by hughdbrown:
results = [x for x in itertools.product(range(3), repeat=12) if myfilter(x)]

itertools has functionality for dealing with this. However, here is a (hardcoded) way of handling with a generator:
T = (0,1,2)
GEN = ((a,b,c,d,e,f,g,h,i,j,k,l) for a in T for b in T for c in T for d in T for e in T for f in T for g in T for h in T for i in T for j in T for k in T for l in T)
for VAL in GEN:
# Filter VAL
print VAL

I'd implement an iterative binary adder or hamming code and run that way.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.