Python: Functionally Merging Two Iterators Where One is Recursive - python

The related question How do I merge two python iterators? works well for two independent iterators. However, I haven't been able to find or think of the tools necessary for merging two iterators where one is recursive and takes the other as an input. I have iterator stuff that is a simple list. Then I have iterator theta that takes a function func and yields x, func(x), func(func(x)), where one of the inputs to func is an element of stuff. I've solved this with mutable state as follows:
theta = some_initial_theta
for thing in stuff:
theta = update_theta(theta, thing)
return theta
A concrete example in this format:
def update_theta(theta, thing):
return thing * 2 + theta
stuff = [100, 200, 300, 400]
def my_iteration():
theta = 0
for thing in stuff:
theta = update_theta(theta, thing)
print(theta)
# This prints 2000
I'm sure there's an elegant way of doing this without the mutable state and the for loop. A simple zip doesn't do it for me because the theta iterator uses its previous element as an input to the next element.
One elegant way of expressing theta is using the iterate method available in the more_itertools package:
iterate(lambda theta: update_theta(theta, thing), some_initial_theta)
However, the problem with this is that thing will be fixed throughout the iteration. It would be possible to deal with this by passing in the entire list stuff and then return the remainder of it from the update_theta method:
iterate(lambda theta: update_theta(theta[0], theta[1]), (some_initial_theta, stuff))
However, I'd really rather not modify the update_theta method to take an entire list it's not interested in and deal with the mechanics of returning the tail of that list. While it's programmatically not difficult, it's poor separation of concerns. update_theta shouldn't know anything about or care about the entire list stuff.

As Peter Wood suggests in the comments, this is exactly what the built-in function reduce does:
result = reduce(update_theta, stuff, some_initial_theta)
In Python 3, reduce has been moved to functools.reduce, so you'd need to import that:
from functools import reduce
If you want an iterator of all the intermediate values, Python 3 provides itertools.accumulate. There's no argument to specify an initial value, so you'd need to put the initial value in the iterator:
from itertools import accumulate, chain
result_iterator = accumulate(chain([some_initial_theta], stuff), update_theta)
Python 2 doesn't have itertools.accumulate, but you could copy the equivalent code from the Python 3 documentation. There's no easy way to formulate it in terms of the Python 2 standard tools, which is why people wanted it added to Python 3 in the first place.

Related

out-of-core/external-memory combinatorics in python

I am iterating the search space of valid Python3 ASTs. With max recursion depth = 3, my laptop runs out of memory. My implementation makes heavy use of generators, specifically 'yield' and itertools.product().
Ideally, I'd replace product() and the max recursion depth with some sort of iterative deepening, but first things first:
Are there any libraries or useful SO posts for out-of-core/external-memory combinatorics?
If not... I am considering the feasibility of using either dask or joblib's delayed()... or perhaps wendelin-core's ZBigArray, though I don't like the looks of its interface:
root = dbopen('test.fs')
root['A'] = A = ZBigArray((10,), np.int)
transaction.commit()
Based on this example, I think that my solution would involve an annotation/wrapper function that eagerly converts the generators to ZBigArrays, replacing root['A'] with something like root[get_key(function_name, *function_args)] It's not pretty, since my generators are not entirely pure--the output is shuffled. In my current version, this shouldn't be a big deal, but the previous and next versions involve using various NNs and RL rather mere shuffling.
First things first- the reason you're getting the out of memory error is because itertools.product() caches intermediate values. It has no idea if the function that gave you your generator is idempotent, and even if it did, it couldn't be able to infer how to call it again given just the generator. This means itertools.product must cache values of each iterable its passed.
The solution here is to bite the small performance bullet and either write explicit for loops, or write your own cartesian product function, which takes functions that would produce each generator. For instance:
def product(*funcs, repeat=None):
if not funcs:
yield ()
return
if repeat is not None:
funcs *= repeat
func, *rest = funcs
for val in func():
for res in product(*rest):
yield (val, ) + res
from functools import partial
values = product(partial(gen1, arg1, arg2), partial(gen2, arg1))
The bonus from rolling your own here is that you can also change how it goes about traversing the A by B by C ... dimensional search space, so that you could do maybe a breadth-first search instead of an iteratively deepening DFS. Or, maybe pick some random space-filling curve, such as the Hilbert Curve which would iterate all indices/depths of each dimension in your product() in a local-centric fashion.
Apart from that, I have one more thing to point out- you can also implement BFS lazily (using generators) to avoid building a queue that could bloat memory usage as well. See this SO answer, copied below for convenience:
def breadth_first(self):
yield self
for c in self.breadth_first():
if not c.children:
return # stop the recursion as soon as we hit a leaf
yield from c.children
Overall, you will take a performance hit from using semi-coroutines, with zero caching, all in python-land (in comparison to the baked in and heavily optimized C of CPython). However, it should still be doable- algorithmic optimizations (avoiding generating semantically nonsensical ASTs, prioritizing ASTs that suit your goal, etc.) will have a larger impact than the constant-factor performance hit.

Idiomatic way to call method on all objects in a list of objects Python 3

I have a list of objects and they have a method called process. In Python 2 one could do this
map(lambda x: x.process, my_object_list)
In Python 3 this will not work because map doesn't call the function until the iterable is traversed. One could do this:
list(map(lambda x: x.process(), my_object_list))
But then you waste memory with a throwaway list (an issue if the list is big). I could also use a 2-line explicit loop. But this pattern is so common for me that I don't want to, or think I should need to, write a loop every time.
Is there a more idiomatic way to do this in Python 3?
Don't use map or a list comprehension where simple for loop will do:
for x in list_of_objs:
x.process()
It's not significantly longer than any function you might use to abstract it, but it is significantly clearer.
Of course, if process returns a useful value, then by all means, use a list comprehension.
results = [x.process() for x in list_of_objs]
or map:
results = list(map(lambda x: x.process(), list_of_objs))
There is a function available that makes map a little less clunky, especially if you would reuse the caller:
from operator import methodcaller
processor = methodcaller('process')
results = list(map(processor, list_of_objs))
more_results = list(map(processor, another_list_of_objs))
If you are looking for a good name for a function to wrap the loop, Haskell has a nice convention: a function name ending with an underscore discards its "return value". (Actually, it discards the result of a monadic action, but I'd rather ignore that distinction for the purposes of this answer.)
def map_(f, *args):
for f_args in zip(*args):
f(*f_args)
# Compare:
map(f, [1,2,3]) # -- return value of [f(1), f(2), f(3)] is ignored
map_(f, [1,2,3]) # list of return values is never built
Since you're looking for a Pythonic solution, why would even bother trying to adapt map(lambda x: x.process, my_object_list) for Python 3 ?
Isn't a simple for loop enough ?
for x in my_object_list:
x.process()
I mean, this is concise, readable and avoid creating an unnecessary list if you don't need return values.

iterating over a single list in parallel in python

The objective is to do calculations on a single iter in parallel using builtin sum & map functions concurrently. Maybe using (something like) itertools instead of classic for loops to analyze (LARGE) data that arrives via an iterator...
In one simple example case I want to calculate ilen, sum_x & sum_x_sq:
ilen,sum_x,sum_x_sq=iterlen(iter),sum(iter),sum(map(lambda x:x*x, iter))
But without converting the (large) iter to a list (as with iter=list(iter))
n.b. Do this using sum & map and without for loops, maybe using the itertools and/or threading modules?
def example_large_data(n=100000000, mean=0, std_dev=1):
for i in range(n): yield random.gauss(mean,std_dev)
-- edit --
Being VERY specific: I was taking a good look at itertools hoping that there was a dual function like map that could do it. For example: len_x,sum_x,sum_x_sq=itertools.iterfork(iter_x,iterlen,sum,sum_sq)
If I was to be very very specific: I am looking for just one answer, python source code for the "iterfork" procedure.
You can use itertools.tee to turn your single iterator into three iterators which you can pass to your three functions.
iter0, iter1, iter2 = itertools.tee(input_iter, 3)
ilen, sum_x, sum_x_sq = count(iter0),sum(iter1),sum(map(lambda x:x*x, iter2))
That will work, but the builtin function sum (and map in Python 2) is not implemented in a way that supports parallel iteration. The first function you call will consume its iterator completely, then the second one will consume the second iterator, then the third function will consume the third iterator. Since tee has to store the values seen by one of its output iterators but not all of the others, this is essentially the same as creating a list from the iterator and passing it to each function.
Now, if you use generator functions that consume only a single value from their input for each value they output, you might be able to make parallel iteration work using zip. In Python 3, map and zip are both generators. The question is how to make sum into a generator.
I think you can get pretty much what you want by using itertools.accumulate (which was added in Python 3.2). It is a generator that yields a running sum of its input. Here's how you could make it work for your problem (I'm assuming your count function was supposed to be an iterator-friendly version of len):
iter0, iter1, iter2 = itertools.tee(input_iter, 3)
len_gen = itertools.accumulate(map(lambda x: 1, iter0))
sum_gen = itertools.accumulate(iter1)
sum_sq_gen = itertools.accumulate(map(lambda x: x*x, iter2))
parallel_gen = zip(len_gen, sum_gen, sum_sq_gen) # zip is a generator in Python 3
for ilen, sum_x, sum_x_sq in parallel_gen:
pass # the generators do all the work, so there's nothing for us to do here
# ilen_x, sum_x, sum_x_sq have the right values here!
If you're using Python 2, rather than 3, you'll have to write your own accumulate generator function (there's a pure Python implementation in the docs I linked above), and use itertools.imap and itertools.izip rather than the builtin map and zip functions.

Python standard function for dual of map

Does the Python language have a built-in function for an analog of map that sends an argument to a sequence of functions, rather than a function to a sequence of arguments?
Plain map would have "type" (thinking like Haskell) (a -> b) -> [a] -> [b]; is there anything with the corresponding type a -> [(a -> b)] -> [b]?
I could implement this in a number of ways. Here's using a lambda
def rev_map(x, seq):
evaluate_yourself_at_x = lambda f: f(x)
return map(evaluate_yourself_at_x, seq)
rev_map([1,2], [sum, len, type])
which prints [3, 2, list].
I'm just curious if this concept of "induce a function to evaluate itself at me" has a built-in or commonly used form.
One motivation for me is thinking about dual spaces in functional analysis, where a space of elements which used to be conceived of as arguments passed to functions is suddenly conceived of as a space of elements which are functions whose operation is to induce another function to be evaluated at them.
You could think of a function like sin as being an infinite map from numbers to numbers, you give sin a number, sin gives you some associated number back, like sin(3) or something.
But then you could also think of the number 3 as an infinite map from functions to numbers, you give 3 a function f and 3 gives you some associated number, namely f(3).
I'm finding cases where I'd like some efficient syntax to suddenly view "arguments" or "elements" as "function-call-inducers" but most things, e.g. my lambda approach above, seem clunky.
Another thought I had was to write wrapper classes for the "elements" where this occurs. Something like:
from __future__ import print_function
class MyFloat(float):
def __call__(self, f):
return f(self)
m = MyFloat(3)
n = MyFloat(2)
MyFloat(m + n)(type)
MyFloat(m + n)(print)
which will print __main__.MyFloat and 5.0.
But this requires a lot of overhead to redefine data model operators and so on, and clearly it's not a good idea to push around your own version of very basic things like float which will be ubiquitous in most programs. It's also easy to get it wrong, like from my example above, doing this:
# Will result in a recursion error.
MyFloat(3)(MyFloat(4))
There is no built-in function for that. Simply because that's definitely not a commonly used concept. Plus Python is not designed to solve mathematical problems.
As for the implementation here's the shortest one you can get IMHO:
rev_map = lambda x, seq: [f(x) for f in seq]
Note that the list comprehension is so short and easy that wrapping it with a function seems to be unnecessary in the first place.

Perform operations on elements of a NumPy array

Is there a faster/smarter way to perform operations on every element of a numpy array? What I specifically have is a list of datetime objects like, e.g.:
hh = np.array( [ dt.date(2000, 1, 1), dt.date(2001, 1, 1) ] )
To get a list of of years from that I do at the moment:
years = np.array( [ x.year for x in hh ] )
Is there a smarter way to do this? I'm thinking something like
hh.year
which obviously doesn't work.
I have a script in which I need different variations of a (much longer) array constantly (year, month, hours...). Of course I could always just define a separate array for everything but like there should be a more elegant solution.
If you evaluate a python expression for each element, it doesn't matter whether the iteration will be done in C++ or Python. What will have weight is the python-complexity of the evaluated (in-loop) expression. This means: If your (in-loop) expression takes 1 microsec (a very simple script), it will be significantly harder than the difference between using a python iteration or a C++ iteration (you have a "marshalling" between C++ and PyObjects, and that applies to python functions as well).
For that reason, calling vectorize is -under the hoods- done in Python: what is called inside is python code. The idea behind vectorize is not performance, but code readability and ease of iteration: vectorize performs introspection (of function's parameters) and serves well for N-dimensional iterations (i.e. a lambda x,y: x+y automagically serves to iterate in two dimensions).
So: no, there's no "fast" way to iterate python code. The final speed that matters is the speed of your inner python code.
Edit: your -desired- hh.year looks like hh*.year equivalent in groovy, but even there under the hoods is the same as an in-code iteration. Comprehensions are the fastest (and equivalent) way in python. The real pity is being forced to:
years = np.array( [ x.year for x in hh ] )
(which forces you to create another provably-huge-sized) instead of letting you use any type of iterator:
years = np.array( x.year for x in hh )
Edit (suggestion by #Jaime): You can't construct array with that function from an iterator. For that, you must use:
np.fromiter(x.year for x in hh, dtype=int, count=len(x))
which lets you save the time and memory of building an intermediate array. This exact approach works for any sequence to avoid the inner-list creation (this one would be your case) but does not work with other types of generators, for future cases you'd need.
You can use numpy.vectorize.
Doing some benchmarking, performance is pretty similar (vectorize slightly slower than a list comprehension), and in my opinion numpy.vectorize(lambda j: j.year)(hh) (or something similar) doesn't look super elegant.

Categories

Resources