Consider this:
>>> res = [list(g) for k,g in itertools.groupby('abbba')]
>>> res
[['a'], ['b', 'b', 'b'], ['a']]
and then this:
>>> res = [g for k,g in itertools.groupby('abbba')]
>>> list(res[0])
[]
I'm baffled by this. Why do they return different results?
This is expected behavior. The documentation is pretty clear that the iterator for the grouper is shared with the groupby iterator:
The returned group is itself an iterator that shares the underlying
iterable with groupby(). Because the source is shared, when the
groupby() object is advanced, the previous group is no longer visible.
So, if that data is needed later, it should be stored as a list...
The reason you are getting empty lists as that the iterator is already consumed by the time you are trying to iterate over it.
import itertools
res = [g for k,g in itertools.groupby('abbba')]
next(res[0])
# Raises StopIteration:
Related
If we run the following code,
from itertools import groupby
s = '1223'
r = groupby(s)
x = list(r)
a = [list(g) for k, g in r]
print(a)
b =[list(g) for k, g in groupby(s)]
print(b)
then surprisingly the two output lines are DIFFERENT:
[]
[['1'], ['2', '2'], ['3']]
But if we remove the line "x=list(r)", then the two lines are the same, as expected. I don't understand why the list() function will change the groupby result.
The result of groupby, as with many objects in the itertools library, is an iterable that can only be iterated over once. This is to allow lazy evaluation. Therefore, when you call something like list(r), r is now an empty iterable.
When you iterate over the empty iterable, of course the resulting list is empty. In your second case, you don't consume the iterable before you iterate over it. Thus, the iterable is not empty.
I have a bit of code that runs many thousands of times in my project:
def resample(freq, data):
output = []
for i, elem in enumerate(freq):
for _ in range(elem):
output.append(data[i])
return output
eg. resample([1,2,3], ['a', 'b', 'c']) => ['a', 'b', 'b', 'c', 'c', 'c']
I want to speed this up as much as possible. It seems like a list comprehension could be faster. I have tried:
def resample(freq, data):
return [item for sublist in [[data[i]]*elem for i, elem in enumerate(frequencies)] for item in sublist]
Which is hideous and also slow because it builds the list and then flattens it. Is there a way to do this with one line list comprehension that is fast? Or maybe something with numpy?
Thanks in advance!
edit: Answer does not necessarily need to eliminate the nested loops, fastest code is the best
I highly suggest using generators like so:
from itertools import repeat, chain
def resample(freq, data):
return chain.from_iterable(map(repeat, data, freq))
This will probably be the fastest method there is - map(), repeat() and chain.from_iterable() are all implemented in C so you technically can't get any better.
As for a small explanation:
repeat(i, n) returns an iterator that repeats an item i, n times.
map(repeat, data, freq) returns an iterator that calls repeat every time on an element of data and an element of freq. Basically an iterator that returns repeat() iterators.
chain.from_iterable() flattens the iterator of iterators to return the end items.
No list is created on the way, so there is no overhead and as an added benefit - you can use any type of data and not just one char strings.
While I don't suggest it, you are able to convert it into a list() like so:
result = list(resample([1,2,3], ['a','b','c']))
import itertools
def resample(freq, data):
return itertools.chain.from_iterable([el]*n for el, n in zip(data, freq))
Besides faster, this also has the advantage of being lazy, it returns a generator and the elements are generated step by step
No need to create lists at all, just use a nested loop:
[e for i, e in enumerate(data) for j in range(freq[i])]
# ['a', 'b', 'b', 'c', 'c', 'c']
You can just as easily make this lazy by removing the brackets:
(e for i, e in enumerate(data) for j in range(freq[i]))
I was trying to use itertools.groupby to help me group a list of integers by positive or negative property, for example:
input
[1,2,3, -1,-2,-3, 1,2,3, -1,-2,-3]
will return
[[1,2,3],[-1,-2,-3],[1,2,3],[-1,-2,-3]]
However if I:
import itertools
nums = [1,2,3, -1,-2,-3, 1,2,3, -1,-2,-3]
group_list = list(itertools.groupby(nums, key=lambda x: x>=0))
print(group_list)
for k, v in group_list:
print(list(v))
>>>
[]
[-3]
[]
[]
But if I don't list() the groupby object, it will work fine:
nums = [1,2,3, -1,-2,-3, 1,2,3, -1,-2,-3]
group_list = itertools.groupby(nums, key=lambda x: x>=0)
for k, v in group_list:
print(list(v))
>>>
[1, 2, 3]
[-1, -2, -3]
[1, 2, 3]
[-1, -2, -3]
What I don't understand is, a groupby object is a iterator composed by a pair of key and _grouper object, a call of list() of a groupby object should not consume the _grouper object?
And even if it did consume, how did I get [-3] from the second element?
Per the docs, it is explicitly noted that advancing the groupby object renders the previous group unusable (in practice, empty):
The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list.
Basically, instead of list-ifying directly with the list constructor, you'd need a listcomp that converts from group iterators to lists before advancing the groupby object, replacing:
group_list = list(itertools.groupby(nums, key=lambda x: x>=0))
with:
group_list = [(k, list(g)) for k, g in itertools.groupby(nums, key=lambda x: x>=0)]
The design of most itertools module types is intended to avoid storing data implicitly, because they're intended to be used with potentially huge inputs. If all the groupers stored copies of all the data from the input (and the groupby object had to be sure to retroactively populate them), it would get ugly, and potentially blow memory by accident. By forcing you to make storing the values explicit, you don't accidentally store unbounded amounts of data unintentionally, per the Zen of Python:
Explicit is better than implicit.
I am trying to learn how to use itertools.groupby in Python and I wanted to find the size of each group of characters. At first I tried to see if I could find the length of a single group:
from itertools import groupby
len(list(list( groupby("cccccaaaaatttttsssssss") )[0][1]))
and I would get 0 every time.
I did a little research and found out that other people were doing it this way:
from itertools import groupby
for key,grouper in groupby("cccccaaaaatttttsssssss"):
print key,len(list(grouper))
Which works great. What I am confused about is why does the latter code work, but the former does not? If I wanted to get only the nth group like I was trying to do in my original code, how would I do that?
The reason that your first approach doesn't work is that the the groups get "consumed" when you create that list with
list(groupby("cccccaaaaatttttsssssss"))
To quote from the groupby docs
The returned group is itself an iterator that shares the underlying
iterable with groupby(). Because the source is shared, when the
groupby() object is advanced, the previous group is no longer
visible.
Let's break it down into stages.
from itertools import groupby
a = list(groupby("cccccaaaaatttttsssssss"))
print(a)
b = a[0][1]
print(b)
print('So far, so good')
print(list(b))
print('What?!')
output
[('c', <itertools._grouper object at 0xb715104c>), ('a', <itertools._grouper object at 0xb715108c>), ('t', <itertools._grouper object at 0xb71510cc>), ('s', <itertools._grouper object at 0xb715110c>)]
<itertools._grouper object at 0xb715104c>
So far, so good
[]
What?!
Our itertools._grouper object at 0xb715104c is empty because it shares its contents with the "parent" iterator returned by groupby, and those items are now gone because that first list call iterated over the parent.
It's really no different to what happens if you try to iterate twice over any iterator, eg a simple generator expression.
g = (c for c in 'python')
print(list(g))
print(list(g))
output
['p', 'y', 't', 'h', 'o', 'n']
[]
BTW, here's another way to get the length of a groupby group if you don't actually need its contents; it's a little cheaper (and uses less RAM) than building a list just to find its length.
from itertools import groupby
for k, g in groupby("cccccaaaaatttttsssssss"):
print(k, sum(1 for _ in g))
output
c 5
a 5
t 5
s 7
I would have expected these two pieces of code to produce the same results
from itertools import groupby
for i in list(groupby('aaaabb')):
print i[0], list(i[1])
for i, j in groupby('aaaabb'):
print i, list(j)
In one I convert the iterator returned by groupby to a list and iterate over that, and in the other I iterate over the returned iterator directly.
The output of this script is
a []
b ['b']
a ['a', 'a', 'a', 'a']
b ['b', 'b']
Why is this the case?
Edit: for reference, the result of groupby('aabbaa') looks like
('a', <itertools._grouper object at 0x10c1324d0>)
('b', <itertools._grouper object at 0x10c132250>)
This is a quirk of the groupby function, presumably for performance.
From the itertools.groupby documentation:
The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list:
groups = []
uniquekeys = []
data = sorted(data, key=keyfunc)
for k, g in groupby(data, keyfunc):
groups.append(list(g)) # Store group iterator as a list
uniquekeys.append(k)
So, you can do this:
for i in [x, list(y) for x, y in groupby('aabbaa')]:
print i[0], i[1]