Applying different functions to generator depending on its length - python

Problem
I have an iterator spam to which I want to apply the function foo if it generates few items, and bar otherwise. With other words, I wish to translate the following code for iterables to generators:
if len(spam) <= max_size_for_foo:
foo(spam)
else:
bar(spam)
max_size_for_foo is comparably small so there is no problem if a list or other iterable of this length is created, but if spam is long, it must not be converted to a list (otherwise, memory problems ensue).
Dissatisfying solution
The best solution I could come up with so far is the following:
first_items = []
try:
for i in range(max_size_for_foo+1):
first_items.append(next(my_generator))
except StopIteration:
foo(first_items)
else:
my_generator = chain(first_items, my_generator)
bar(my_generator)
However, extracting a temporary list and chaining back into the generator feels rather dirty and inelegant to me.
Question
Is there a more elegant or Pythonesque way to do this?

The easiest way is probably to define your generator in a function so that it can be reused:
def spam_func():
return (i for i in [1, 2, 3])
spam_length = sum(1 for _ in spam_func())
if spam_length <= max_size_for_foo:
foo(spam_func())
else:
bar(spam_func())

There is no such thing as generator length, because generator easily can be infinite:
def square():
a = 0
while True:
yield a**2
a += 1
One solution is sum(1 for _ in gen).
Another one may be wrap your generator with a class something like this:
class wrapper:
def __init__(self, items):
self.items = items
def __len__(self):
return len(self.items)
def generate(self):
yeild from self.items

Related

Get length of a (non infinite) iterator inside it's loop using Python 2.7

I'm working with some iterator generated using itertools.imap, and I was thinking if there is a way to access the iterator length inside the for-loop that I use to loop over the elements.
What I can say for sure is that the iterator doesn't generate an infinite amount of data.
Also, because the information I'm looping are from a query to a database, I can get the length of the information from there, but the function I'm using has to return an iterator.
I thought of some options:
def iterator_function(some, arguments, that, I, need):
query_result = query()
return_iterator = itertools.imap(
mapping_function,
query_result
)
return return_iterator
Because I cannot change the returned iterator, I thought of something (really ugly) like:
query_result = query()
query_size = len(query_result)
return_iterator = itertools.imap(
lambda item: (mapping_function(item), query_size),
query_result
)
return return_iterator
But I don't really like this option, and I was thinking if there is a way, in Python, to get the iterator size from inside the loop over the iterator, something like:
for item in iterator():
print item.iterator_length()
# do other stuff
Or even something like:
for item in iterator():
print iterator.length() # or iterator().length()???
Thanks!
I don't know if my idea is correct but how about class generator pattern
and trying to add sequence bahaviour :
if your class represents something that has a length, don't define a GetLength method; define the __len__ method and use len(instance).
something like this :
class firstn(object):
def __init__(self, n):
self.n = n
self.num, self.nums = 0, []
def __iter__(self):
return self
# Python 3 compatibility
def __next__(self):
return self.next()
# V------- Something like this
def __len__(self):
return self.my_length
def next(self):
if self.num < self.n:
cur, self.num = self.num, self.num+1
return cur
else:
raise StopIteration()
Also, because the information I'm looping are from a query to a
database, I can get the length of the information from there, but the
function I'm using has to return an iterator.
Assuming you have a good database, do the count there. Doubt any solution in python will be faster/cleaner.

How can I have multiple iterators over a single python iterable at the same time?

I would like to compare all elements in my iterable object combinatorically with each other. The following reproducible example just mimics the functionality of a plain list, but demonstrates my problem. In this example with a list of ["A","B","C","D"], I would like to get the following 16 lines of output, every combination of each item with each other. A list of 100 items should generate 100*100=10,000 lines.
A A True
A B False
A C False
... 10 more lines ...
D B False
D C False
D D True
The following code seemed like it should do the job.
class C():
def __init__(self):
self.stuff = ["A","B","C","D"]
def __iter__(self):
self.idx = 0
return self
def __next__(self):
self.idx += 1
if self.idx > len(self.stuff):
raise StopIteration
else:
return self.stuff[self.idx - 1]
thing = C()
for x in thing:
for y in thing:
print(x, y, x==y)
But after finishing the y-loop, the x-loop seems done, too, even though it's only used the first item in the iterable.
A A True
A B False
A C False
A D False
After much searching, I eventually tried the following code, hoping that itertools.tee would allow me two independent iterators over the same data:
import itertools
thing = C()
thing_one, thing_two = itertools.tee(thing)
for x in thing_one:
for y in thing_two:
print(x, y, x==y)
But I got the same output as before.
The real-world object this represents is a model of a directory and file structure with varying numbers of files and subdirectories, at varying depths into the tree. It has nested links to thousands of members and iterates correctly over them once, just like this example. But it also does expensive processing within its many internal objects on-the-fly as needed for comparisons, which would end up doubling the workload if I had to make a complete copy of it prior to iterating. I would really like to use multiple iterators, pointing into a single object with all the data, if possible.
Edit on answers: The critical flaw in the question code, pointed out in all answers, is the single internal self.idx variable being unable to handle multiple callers independently. The accepted answer is the best for my real class (oversimplified in this reproducible example), another answer presents a simple, elegant solution for simpler data structures like the list presented here.
It's actually impossible to make a container class that is it's own iterator. The container shouldn't know about the state of the iterator and the iterator doesn't need to know the contents of the container, it just needs to know which object is the corresponding container and "where" it is. If you mix iterator and container different iterators will share state with each other (in your case the self.idx) which will not give the correct results (they read and modify the same variable).
That's the reason why all built-in types have a seperate iterator class (and even some have an reverse-iterator class):
>>> l = [1, 2, 3]
>>> iter(l)
<list_iterator at 0x15e360c86d8>
>>> reversed(l)
<list_reverseiterator at 0x15e360a5940>
>>> t = (1, 2, 3)
>>> iter(t)
<tuple_iterator at 0x15e363fb320>
>>> s = '123'
>>> iter(s)
<str_iterator at 0x15e363fb438>
So, basically you could just return iter(self.stuff) in __iter__ and drop the __next__ altogether because list_iterator knows how to iterate over the list:
class C:
def __init__(self):
self.stuff = ["A","B","C","D"]
def __iter__(self):
return iter(self.stuff)
thing = C()
for x in thing:
for y in thing:
print(x, y, x==y)
prints 16 lines, like expected.
If your goal is to make your own iterator class, you need two classes (or 3 if you want to implement the reversed-iterator yourself).
class C:
def __init__(self):
self.stuff = ["A","B","C","D"]
def __iter__(self):
return C_iterator(self)
def __reversed__(self):
return C_reversed_iterator(self)
class C_iterator:
def __init__(self, parent):
self.idx = 0
self.parent = parent
def __iter__(self):
return self
def __next__(self):
self.idx += 1
if self.idx > len(self.parent.stuff):
raise StopIteration
else:
return self.parent.stuff[self.idx - 1]
thing = C()
for x in thing:
for y in thing:
print(x, y, x==y)
works as well.
For completeness, here's one possible implementation of the reversed-iterator:
class C_reversed_iterator:
def __init__(self, parent):
self.parent = parent
self.idx = len(parent.stuff) + 1
def __iter__(self):
return self
def __next__(self):
self.idx -= 1
if self.idx <= 0:
raise StopIteration
else:
return self.parent.stuff[self.idx - 1]
thing = C()
for x in reversed(thing):
for y in reversed(thing):
print(x, y, x==y)
Instead of defining your own iterators you could use generators. One way was already shown in the other answer:
class C:
def __init__(self):
self.stuff = ["A","B","C","D"]
def __iter__(self):
yield from self.stuff
def __reversed__(self):
yield from self.stuff[::-1]
or explicitly delegate to a generator function (that's actually equivalent to the above but maybe more clear that it's a new object that is produced):
def C_iterator(obj):
for item in obj.stuff:
yield item
def C_reverse_iterator(obj):
for item in obj.stuff[::-1]:
yield item
class C:
def __init__(self):
self.stuff = ["A","B","C","D"]
def __iter__(self):
return C_iterator(self)
def __reversed__(self):
return C_reverse_iterator(self)
Note: You don't have to implement the __reversed__ iterator. That was just meant as additional "feature" of the answer.
Your __iter__ is completely broken. Instead of actually making a fresh iterator on every call, it just resets some state on self and returns self. That means you can't actually have more than one iterator at a time over your object, and any call to __iter__ while another loop over the object is active will interfere with the existing loop.
You need to actually make a new object. The simplest way to do that is to use yield syntax to write a generator function. The generator function will automatically return a new iterator object every time:
class C(object):
def __init__(self):
self.stuff = ['A', 'B', 'C', 'D']
def __iter__(self):
for thing in self.stuff:
yield thing

python generator of generators?

I wrote a class that reads a txt file. The file is composed of blocks of non-empty lines (let's call them "sections"), separated by an empty line:
line1.1
line1.2
line1.3
line2.1
line2.2
My first implementation was to read the whole file and return a list of lists, that is a list of sections, where each section is a list of lines.
This was obviously terrible memory-wise.
So I re-implemented it as a generator of lists, that is at every cycle my class reads a whole section in memory as a list and yields it.
This is better, but it's still problematic in case of large sections. So I wonder if I can reimplement it as a generator of generators? The problem is that this class is very generic, and it should be able to satisfy both of these use cases:
read a very big file, containing very big sections, and cycle through it only once. A generator of generators is perfect for this.
read a smallish file into memory to be cycled over multiple times. A generator of lists works fine, because the user can just invoke
list(MyClass(file_handle))
However, a generator of generators would NOT work in case 2, as the inner objects would not be transformed to lists.
Is there anything more elegant than implementing an explicit to_list() method, that would transform the generator of generators into a list of lists?
Python 2:
map(list, generator_of_generators)
Python 3:
list(map(list, generator_of_generators))
or for both:
[list(gen) for gen in generator_of_generators]
Since the generated objects are generator functions, not mere generators, you'd want to do
[list(gen()) for gen in generator_of_generator_functions]
If that doesn't work I have no idea what you're asking. Also, why would it return a generator function and not a generator itself?
Since in the comments you said you wanted to avoid list(generator_of_generator_functions) from crashing mysteriously, this depends on what you really want.
It is not possible to overwrite the behaviour of list in this way: either you store the sub-generator elements or not
If you really do get a crash, I recommend exhausting the sub-generator with the main generator loop every time the main generator iterates. This is standard practice and exactly what itertools.groupby does, a stdlib generator-of-generators.
eg.
def metagen():
def innergen():
yield 1
yield 2
yield 3
for i in range(3):
r = innergen()
yield r
for _ in r: pass
Or use a dark, secret hack method that I'll show in a mo' (I need to write it), but don't do it!
As promised, the hack (for Python 3, this time 'round):
from collections import UserList
from functools import partial
def objectitemcaller(key):
def inner(*args, **kwargs):
try:
return getattr(object, key)(*args, **kwargs)
except AttributeError:
return NotImplemented
return inner
class Listable(UserList):
def __init__(self, iterator):
self.iterator = iterator
self.iterated = False
def __iter__(self):
return self
def __next__(self):
self.iterated = True
return next(self.iterator)
def _to_list_hack(self):
self.data = list(self)
del self.iterated
del self.iterator
self.__class__ = UserList
for key in UserList.__dict__.keys() - Listable.__dict__.keys():
if key not in ["__class__", "__dict__", "__module__", "__subclasshook__"]:
setattr(Listable, key, objectitemcaller(key))
def metagen():
def innergen():
yield 1
yield 2
yield 3
for i in range(3):
r = Listable(innergen())
yield r
if not r.iterated:
r._to_list_hack()
else:
for item in r: pass
for item in metagen():
print(item)
print(list(item))
#>>> <Listable object at 0x7f46e4a4b850>
#>>> [1, 2, 3]
#>>> <Listable object at 0x7f46e4a4b950>
#>>> [1, 2, 3]
#>>> <Listable object at 0x7f46e4a4b990>
#>>> [1, 2, 3]
list(metagen())
#>>> [[1, 2, 3], [1, 2, 3], [1, 2, 3]]
It's so bad I don't want to even explain it.
The key is that you have a wrapper that can detect whether it has been iterated, and if not you run a _to_list_hack that, I kid you not, changes the __class__ attribute.
Because of conflicting layouts we have to use the UserList class and shadow all of its methods, which is just another layer of crud.
Basically, please don't use this hack. You can enjoy it as humour, though.
A rather pragmatic way would be to tell the "generator of generators" upon creation whether to generate generators or lists. While this is not as convenient as having list magically know what to do, it still seems to be more comfortable than having a special to_list function.
def gengen(n, listmode=False):
for i in range(n):
def gen():
for k in range(i+1):
yield k
yield list(gen()) if listmode else gen()
Depending on the listmode parameter, this can either be used to generate generators or lists.
for gg in gengen(5, False):
print gg, list(gg)
print list(gengen(5, True))

How to len(generator()) [duplicate]

This question already has answers here:
Length of generator output [duplicate]
(9 answers)
What's the shortest way to count the number of items in a generator/iterator?
(7 answers)
Closed 9 years ago.
Python generators are very useful. They have advantages over functions that return lists. However, you could len(list_returning_function()). Is there a way to len(generator_function())?
UPDATE:
Of course len(list(generator_function())) would work.....
I'm trying to use a generator I've created inside a new generator I'm creating. As part of the calculation in the new generator it needs to know the length of the old one. However I would like to keep both of them together with the same properties as a generator, specifically - not maintain the entire list in memory as it may be very long.
UPDATE 2:
Assume the generator knows it's target length even from the first step. Also, there's no reason to maintain the len() syntax. Example - if functions in Python are objects, couldn't I assign the length to a variable of this object that would be accessible to the new generator?
The conversion to list that's been suggested in the other answers is the best way if you still want to process the generator elements afterwards, but has one flaw: It uses O(n) memory. You can count the elements in a generator without using that much memory with:
sum(1 for x in generator)
Of course, be aware that this might be slower than len(list(generator)) in common Python implementations, and if the generators are long enough for the memory complexity to matter, the operation would take quite some time. Still, I personally prefer this solution as it describes what I want to get, and it doesn't give me anything extra that's not required (such as a list of all the elements).
Also listen to delnan's advice: If you're discarding the output of the generator it is very likely that there is a way to calculate the number of elements without running it, or by counting them in another manner.
Generators have no length, they aren't collections after all.
Generators are functions with a internal state (and fancy syntax). You can repeatedly call them to get a sequence of values, so you can use them in loop. But they don't contain any elements, so asking for the length of a generator is like asking for the length of a function.
if functions in Python are objects, couldn't I assign the length to a
variable of this object that would be accessible to the new generator?
Functions are objects, but you cannot assign new attributes to them. The reason is probably to keep such a basic object as efficient as possible.
You can however simply return (generator, length) pairs from your functions or wrap the generator in a simple object like this:
class GeneratorLen(object):
def __init__(self, gen, length):
self.gen = gen
self.length = length
def __len__(self):
return self.length
def __iter__(self):
return self.gen
g = some_generator()
h = GeneratorLen(g, 1)
print len(h), list(h)
Suppose we have a generator:
def gen():
for i in range(10):
yield i
We can wrap the generator, along with the known length, in an object:
import itertools
class LenGen(object):
def __init__(self,gen,length):
self.gen=gen
self.length=length
def __call__(self):
return itertools.islice(self.gen(),self.length)
def __len__(self):
return self.length
lgen=LenGen(gen,10)
Instances of LenGen are generators themselves, since calling them returns an iterator.
Now we can use the lgen generator in place of gen, and access len(lgen) as well:
def new_gen():
for i in lgen():
yield float(i)/len(lgen)
for i in new_gen():
print(i)
You can use len(list(generator_function()). However, this consumes the generator, but that's the only way you can find out how many elements are generated. So you may want to save the list somewhere if you also want to use the items.
a = list(generator_function())
print(len(a))
print(a[0])
You can len(list(generator)) but you could probably make something more efficient if you really intend to discard the results.
You can use reduce.
For Python 3:
>>> import functools
>>> def gen():
... yield 1
... yield 2
... yield 3
...
>>> functools.reduce(lambda x,y: x + 1, gen(), 0)
In Python 2, reduce is in the global namespace so the import is unnecessary.
You can use send as a hack:
def counter():
length = 10
i = 0
while i < length:
val = (yield i)
if val == 'length':
yield length
i += 1
it = counter()
print(it.next())
#0
print(it.next())
#1
print(it.send('length'))
#10
print(it.next())
#2
print(it.next())
#3
You can combine the benefits of generators with the certainty of len(), by creating your own iterable object:
class MyIterable(object):
def __init__(self, n):
self.n = n
def __len__(self):
return self.n
def __iter__(self):
self._gen = self._generator()
return self
def _generator(self):
# Put your generator code here
i = 0
while i < self.n:
yield i
i += 1
def next(self):
return next(self._gen)
mi = MyIterable(100)
print len(mi)
for i in mi:
print i,
This is basically a simple implementation of xrange, which returns an object you can take the len of, but doesn't create an explicit list.

What is a good way to decorate an iterator to alter the value before next is called in python?

I am working on a problem that involves validating a format from within unified diff patch.
The variables within the inner format can span multiple lines at a time, so I wrote a generator that pulls each line and yields the variable when it is complete.
To avoid having to rewrite this function when reading from a unified diff file, I created a generator to strip the unified diff characters from the line before passing it to the inner format validator. However, I am getting stuck in an infinite loop (both in the code and in my head). I have abstracted to problem to the following code. I'm sure there is a better way to do this. I just don't know what it is.
from collections import Iterable
def inner_format_validator(inner_item):
# Do some validation to inner items
return inner_item[0] != '+'
def inner_gen(iterable):
for inner_item in iterable:
# Operates only on inner_info type data
yield inner_format_validator(inner_item)
def outer_gen(iterable):
class DecoratedGenerator(Iterable):
def __iter__(self):
return self
def next(self):
# Using iterable from closure
for outer_item in iterable:
self.outer_info = outer_item[0]
inner_item = outer_item[1:]
return inner_item
decorated_gen = DecoratedGenerator()
for inner_item in inner_gen(decorated_gen):
yield inner_item, decorated_gen.outer_info
if __name__ == '__main__':
def wrap(string):
# The point here is that I don't know what the first character will be
pseudo_rand = len(string)
if pseudo_rand * pseudo_rand % 2 == 0:
return '+' + string
else:
return '-' + string
inner_items = ["whatever"] * 3
# wrap screws up inner_format_validator
outer_items = [wrap("whatever")] * 3
# I need to be able to
# iterate over inner_items
for inner_info in inner_gen(inner_items):
print(inner_info)
# and iterate over outer_items
for outer_info, inner_info in outer_gen(outer_items):
# This is an infinite loop
print(outer_info)
print(inner_info)
Any ideas as to a better, more pythonic way to do this?
I would do something simpler, like this:
def outer_gen(iterable):
iterable = iter(iterable)
first_item = next(iterable)
info = first_item[0]
yield info, first_item[1:]
for item in iterable:
yield info, item
This will execute the 4 first lines only once, then enter the loop and yield what you want.
You probably want to add some try/except to cacth IndexErrors here and there.
If you want to take values while they start with something or the contrary, remember you can use a lot of stuff from the itertools toolbox, and in particular dropwhile, takewhile and chain:
>>> import itertools
>>> l = ['+foo', '-bar', '+foo']
>>> list(itertools.takewhile(lambda x: x.startswith('+'), l))
['+foo']
>>> list(itertools.dropwhile(lambda x: x.startswith('+'), l))
['-bar', '+foo']
>>> a = itertools.takewhile(lambda x: x.startswith('+'), l)
>>> b = itertools.dropwhile(lambda x: x.startswith('+'), l)
>>> list(itertools.chain(a, b))
['+foo', '-bar', '+foo']
And remember that you can create generators like comprehension lists, store them in variables and chain them, just like you would pipe linux commands:
import random
def create_item():
return random.choice(('+', '-')) + random.choice(('foo', 'bar'))
random_items = (create_item() for s in xrange(10))
added_items = ((i[0], i[1:]) for i in random_items if i.startswith('+'))
valid_items = ((prefix, line) for prefix, line in added_items if 'foo' in line)
print list(valid_items)
With all this, you should be able to find some pythonic way to solve your problem :-)
I still don't like this very much, but at least it's shorter and a tad more pythonic:
from itertools import imap, izip
from functools import partial
def inner_format_validator(inner_item):
return not inner_item.startswith('+')
inner_gen = partial(imap, inner_format_validator)
def split(astr):
return astr[0], astr[1:]
def outer_gen(iterable):
outer_stuff, inner_stuff = izip(*imap(split, iterable))
return izip(inner_gen(inner_stuff), outer_stuff)
[EDIT] inner_gen() and outer_gen() without imap and partial:
def inner_gen(iterable):
for each in iterable:
yield inner_format_validator(each)
def outer_gen(iterable):
outer_stuff, inner_stuff = izip(*(split(each) for each in iterable))
return izip(inner_gen(inner_stuff), outer_stuff)
Maybe this is a better, though different, solution:
def transmogrify(iter_of_iters, *transmogrifiers):
for iters in iter_of_iters:
yield (
trans(each) if trans else each
for trans, each in izip(transmogrifiers, iters)
)
for outer, inner in transmogrify(imap(split, stuff), inner_format_validator, None):
print inner, outer
I think it will do what you intended if you change the definition of DecoratedGenerator to this:
class DecoratedGenerator(Iterable):
def __iter__(self):
# Using iterable from closure
for outer_item in iterable:
self.outer_info = outer_item[0]
inner_item = outer_item[1:]
yield inner_item
Your original version never terminated because its next() method was stateless and would return the same value every time it was called. You didn't need to have a next() method at all, though--you can implement __iter__() yourself (as I did), and then it all works fine.

Categories

Resources