The objective is to do calculations on a single iter in parallel using builtin sum & map functions concurrently. Maybe using (something like) itertools instead of classic for loops to analyze (LARGE) data that arrives via an iterator...
In one simple example case I want to calculate ilen, sum_x & sum_x_sq:
ilen,sum_x,sum_x_sq=iterlen(iter),sum(iter),sum(map(lambda x:x*x, iter))
But without converting the (large) iter to a list (as with iter=list(iter))
n.b. Do this using sum & map and without for loops, maybe using the itertools and/or threading modules?
def example_large_data(n=100000000, mean=0, std_dev=1):
for i in range(n): yield random.gauss(mean,std_dev)
-- edit --
Being VERY specific: I was taking a good look at itertools hoping that there was a dual function like map that could do it. For example: len_x,sum_x,sum_x_sq=itertools.iterfork(iter_x,iterlen,sum,sum_sq)
If I was to be very very specific: I am looking for just one answer, python source code for the "iterfork" procedure.
You can use itertools.tee to turn your single iterator into three iterators which you can pass to your three functions.
iter0, iter1, iter2 = itertools.tee(input_iter, 3)
ilen, sum_x, sum_x_sq = count(iter0),sum(iter1),sum(map(lambda x:x*x, iter2))
That will work, but the builtin function sum (and map in Python 2) is not implemented in a way that supports parallel iteration. The first function you call will consume its iterator completely, then the second one will consume the second iterator, then the third function will consume the third iterator. Since tee has to store the values seen by one of its output iterators but not all of the others, this is essentially the same as creating a list from the iterator and passing it to each function.
Now, if you use generator functions that consume only a single value from their input for each value they output, you might be able to make parallel iteration work using zip. In Python 3, map and zip are both generators. The question is how to make sum into a generator.
I think you can get pretty much what you want by using itertools.accumulate (which was added in Python 3.2). It is a generator that yields a running sum of its input. Here's how you could make it work for your problem (I'm assuming your count function was supposed to be an iterator-friendly version of len):
iter0, iter1, iter2 = itertools.tee(input_iter, 3)
len_gen = itertools.accumulate(map(lambda x: 1, iter0))
sum_gen = itertools.accumulate(iter1)
sum_sq_gen = itertools.accumulate(map(lambda x: x*x, iter2))
parallel_gen = zip(len_gen, sum_gen, sum_sq_gen) # zip is a generator in Python 3
for ilen, sum_x, sum_x_sq in parallel_gen:
pass # the generators do all the work, so there's nothing for us to do here
# ilen_x, sum_x, sum_x_sq have the right values here!
If you're using Python 2, rather than 3, you'll have to write your own accumulate generator function (there's a pure Python implementation in the docs I linked above), and use itertools.imap and itertools.izip rather than the builtin map and zip functions.
Related
In the code below, i1 is an iterator.
def sq(x):
y = []
for i in x:
y.append(i**2)
return y
l1 = range(5)
s1 = sq(l1)
i1 = iter(s1)
I can write a generator for the same squaring operation. In the code below, g1 is a generator.
def sqg(x):
for i in x:
yield i**2
g1 = sqg(l1)
I know that generators use less code and are simpler to read and write. I know that generators also run faster because they maintain their local states.
Are there any advantages to using i1 over g1?
When you call sq(l1), inside sq, a list y is populated. This consumes memory whose size is proportional to the size of x once exhausted.
In the second case, when you call sqg(l1), sqg does not have any internal list used to store the results. It directly yields computed values, making the memory it consumes constant and independent of the size of x once exhausted.
As for advantages of non-generator iterators over generators, I don't believe there are performance advantages, but there could be structural advantages. A generator (a type of iterator like you noted) is defined to be an iterator returned by calling a function with yield statements inside of it. That means that you cannot add any additional methods that can be called to the object representing the generator, because this special type of iterator is given to you implicitly.
On the other hand, an iterator has a looser definition: an object with a __next__ method and an __iter__ method returning self. You could make a class Squares that follows the previous criteria for an iterator, and in order to get an instance to this iterator, you would have to explicitly instantiate Squares. Because you have control over the attributes of the iterator returned to you, you could add instance methods returning internal state of that iterator that aren't expressed through __next__, whereas with a generator you're locked into the generator object provided to you implicitly. Often a generator will do the job, but sometimes you need to use a non-generator iterator to get the control you need past the functionality provided by __next__.
In this specific case, I don't believe you need the explicit control given to you by using a non-generator iterator, so it would be better to use a generator.
There are advantages of creating a list s1 over a generator - it has a defined length, you can index and slice it, and you can iterate through it multiple times without re-creating it. Maybe you don't count these as advantages of the non-generator iterator, though.
Another difference is that an iterator based on a list involves doing all the work upfront and then caching the results, while the generator does the work one step at a time. If the processing task is resource-intensive, then the list approach will cause an initial pause while the list is generated, then run faster (because you only have to retrieve the results from memory; also consider that the results could be cached in a file, for example). The generator approach would have no initial pause but would then run slower as it generates each result.
When reading articles about the speed of Loop vs List comprehension vs Map, I usually find that list comprehension if faster than map when using lambda functions.
Here is a test I am running:
import timeit
def square(range):
squares = []
for number in range:
squares.append(number*number)
return squares
print(timeit.timeit('map(lambda a: a*a, range(100))', number = 100000))
print(timeit.timeit('[a*a for a in range(100)]', number = 100000))
print(timeit.timeit('square(range(100))', 'from __main__ import square', number = 100000))
and the results :
0.03845796199857432
0.5889980600004492
0.9229458660011005
so Map is the clear winner altough using a lambda function. Has there been a change in python 3.7 causing this notable speed boost ?
First of all, to have a fare comparison you have to convert the result of the map function to list. map in Python 3.X returns an iterator object not a list. Second of all, in CPython implementation built in functions are actually wrappers around c functions which makes them faster than any Python code with same functionality, although when you use lambda inside a built-in function you're actually breaking the chain and this will make it approximately as fast as a Python code.
Another important point is that list comprehension is just a syntactic sugar around a regular loop and you can use it to avoid extra function calls like appending to lists, etc.
I have a list of objects and they have a method called process. In Python 2 one could do this
map(lambda x: x.process, my_object_list)
In Python 3 this will not work because map doesn't call the function until the iterable is traversed. One could do this:
list(map(lambda x: x.process(), my_object_list))
But then you waste memory with a throwaway list (an issue if the list is big). I could also use a 2-line explicit loop. But this pattern is so common for me that I don't want to, or think I should need to, write a loop every time.
Is there a more idiomatic way to do this in Python 3?
Don't use map or a list comprehension where simple for loop will do:
for x in list_of_objs:
x.process()
It's not significantly longer than any function you might use to abstract it, but it is significantly clearer.
Of course, if process returns a useful value, then by all means, use a list comprehension.
results = [x.process() for x in list_of_objs]
or map:
results = list(map(lambda x: x.process(), list_of_objs))
There is a function available that makes map a little less clunky, especially if you would reuse the caller:
from operator import methodcaller
processor = methodcaller('process')
results = list(map(processor, list_of_objs))
more_results = list(map(processor, another_list_of_objs))
If you are looking for a good name for a function to wrap the loop, Haskell has a nice convention: a function name ending with an underscore discards its "return value". (Actually, it discards the result of a monadic action, but I'd rather ignore that distinction for the purposes of this answer.)
def map_(f, *args):
for f_args in zip(*args):
f(*f_args)
# Compare:
map(f, [1,2,3]) # -- return value of [f(1), f(2), f(3)] is ignored
map_(f, [1,2,3]) # list of return values is never built
Since you're looking for a Pythonic solution, why would even bother trying to adapt map(lambda x: x.process, my_object_list) for Python 3 ?
Isn't a simple for loop enough ?
for x in my_object_list:
x.process()
I mean, this is concise, readable and avoid creating an unnecessary list if you don't need return values.
The related question How do I merge two python iterators? works well for two independent iterators. However, I haven't been able to find or think of the tools necessary for merging two iterators where one is recursive and takes the other as an input. I have iterator stuff that is a simple list. Then I have iterator theta that takes a function func and yields x, func(x), func(func(x)), where one of the inputs to func is an element of stuff. I've solved this with mutable state as follows:
theta = some_initial_theta
for thing in stuff:
theta = update_theta(theta, thing)
return theta
A concrete example in this format:
def update_theta(theta, thing):
return thing * 2 + theta
stuff = [100, 200, 300, 400]
def my_iteration():
theta = 0
for thing in stuff:
theta = update_theta(theta, thing)
print(theta)
# This prints 2000
I'm sure there's an elegant way of doing this without the mutable state and the for loop. A simple zip doesn't do it for me because the theta iterator uses its previous element as an input to the next element.
One elegant way of expressing theta is using the iterate method available in the more_itertools package:
iterate(lambda theta: update_theta(theta, thing), some_initial_theta)
However, the problem with this is that thing will be fixed throughout the iteration. It would be possible to deal with this by passing in the entire list stuff and then return the remainder of it from the update_theta method:
iterate(lambda theta: update_theta(theta[0], theta[1]), (some_initial_theta, stuff))
However, I'd really rather not modify the update_theta method to take an entire list it's not interested in and deal with the mechanics of returning the tail of that list. While it's programmatically not difficult, it's poor separation of concerns. update_theta shouldn't know anything about or care about the entire list stuff.
As Peter Wood suggests in the comments, this is exactly what the built-in function reduce does:
result = reduce(update_theta, stuff, some_initial_theta)
In Python 3, reduce has been moved to functools.reduce, so you'd need to import that:
from functools import reduce
If you want an iterator of all the intermediate values, Python 3 provides itertools.accumulate. There's no argument to specify an initial value, so you'd need to put the initial value in the iterator:
from itertools import accumulate, chain
result_iterator = accumulate(chain([some_initial_theta], stuff), update_theta)
Python 2 doesn't have itertools.accumulate, but you could copy the equivalent code from the Python 3 documentation. There's no easy way to formulate it in terms of the Python 2 standard tools, which is why people wanted it added to Python 3 in the first place.
Is it possible to perform multiple loops simultaneously in python.
Like(syntax error, of course):
for a,b in list_of_a,list_of_b:
//do some thing
By simultaneously, I am not meaning the thread or process sense.
I mean, they share the same index or cursor during the iteration.
What I can think of achieving that is:
Use a int variable to act as a shared cursor
put them in a list of tuples and iterate the tuple-list. But creating the list is laborious
I am just wondering if there some built-in functions or simpler syntax to achieve that.
for a,b in zip(list_of_a, list_of_b):
# Do some thing
If you're using Python 2.x, are worried about performance, and/or using iterators instead of lists, consider itertools.izip instead of zip.
In Python 3.x, zip replaces itertools.izip; use list(zip(..)) to get the old (2.x) behavior of zip returning a list.
import itertools
for a, b in itertools.izip(list_a, list_b):
# ...