itertools groupby object not outputting correctly - python

I was trying to use itertools.groupby to help me group a list of integers by positive or negative property, for example:
input
[1,2,3, -1,-2,-3, 1,2,3, -1,-2,-3]
will return
[[1,2,3],[-1,-2,-3],[1,2,3],[-1,-2,-3]]
However if I:
import itertools
nums = [1,2,3, -1,-2,-3, 1,2,3, -1,-2,-3]
group_list = list(itertools.groupby(nums, key=lambda x: x>=0))
print(group_list)
for k, v in group_list:
print(list(v))
>>>
[]
[-3]
[]
[]
But if I don't list() the groupby object, it will work fine:
nums = [1,2,3, -1,-2,-3, 1,2,3, -1,-2,-3]
group_list = itertools.groupby(nums, key=lambda x: x>=0)
for k, v in group_list:
print(list(v))
>>>
[1, 2, 3]
[-1, -2, -3]
[1, 2, 3]
[-1, -2, -3]
What I don't understand is, a groupby object is a iterator composed by a pair of key and _grouper object, a call of list() of a groupby object should not consume the _grouper object?
And even if it did consume, how did I get [-3] from the second element?

Per the docs, it is explicitly noted that advancing the groupby object renders the previous group unusable (in practice, empty):
The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list.
Basically, instead of list-ifying directly with the list constructor, you'd need a listcomp that converts from group iterators to lists before advancing the groupby object, replacing:
group_list = list(itertools.groupby(nums, key=lambda x: x>=0))
with:
group_list = [(k, list(g)) for k, g in itertools.groupby(nums, key=lambda x: x>=0)]
The design of most itertools module types is intended to avoid storing data implicitly, because they're intended to be used with potentially huge inputs. If all the groupers stored copies of all the data from the input (and the groupby object had to be sure to retroactively populate them), it would get ugly, and potentially blow memory by accident. By forcing you to make storing the values explicit, you don't accidentally store unbounded amounts of data unintentionally, per the Zen of Python:
Explicit is better than implicit.

Related

Unable to retrieve itertools iterators outside list comprehensions

Consider this:
>>> res = [list(g) for k,g in itertools.groupby('abbba')]
>>> res
[['a'], ['b', 'b', 'b'], ['a']]
and then this:
>>> res = [g for k,g in itertools.groupby('abbba')]
>>> list(res[0])
[]
I'm baffled by this. Why do they return different results?
This is expected behavior. The documentation is pretty clear that the iterator for the grouper is shared with the groupby iterator:
The returned group is itself an iterator that shares the underlying
iterable with groupby(). Because the source is shared, when the
groupby() object is advanced, the previous group is no longer visible.
So, if that data is needed later, it should be stored as a list...
The reason you are getting empty lists as that the iterator is already consumed by the time you are trying to iterate over it.
import itertools
res = [g for k,g in itertools.groupby('abbba')]
next(res[0])
# Raises StopIteration:

Lambda function with itertools count() and groupby()

Can someone please explain the groupby operation and the lambda function being used on this SO post?
key=lambda k, line=count(): next(line) // chunk
import tempfile
from itertools import groupby, count
temp_dir = tempfile.mkdtemp()
def tempfile_split(filename, temp_dir, chunk=4000000):
with open(filename, 'r') as datafile:
# The itertools.groupby() function takes a sequence and a key function,
# and returns an iterator that generates pairs.
# Each pair contains the result of key_function(each item) and
# another iterator containing all the items that shared that key result.
groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk)
for k, group in groups:
print(key, list(group))
output_name = os.path.normpath(os.path.join(temp_dir + os.sep, "tempfile_%s.tmp" % k))
for line in group:
with open(output_name, 'a') as outfile:
outfile.write(line)
Edit: It took me a while to wrap my head around the lambda function used with groupby. I don't think I understood either of them very well.
Martijn explained it really well, however I have a follow up question. Why is line=count() passed as an argument to the lambda function every time? I tried assigning the variable line to count() just once, outside the function.
line = count()
groups = groupby(datafile, key=lambda k, line: next(line) // chunk)
and it resulted in TypeError: <lambda>() missing 1 required positional argument: 'line'
Also, calling next on count() directly within the lambda expression, resulted in all the lines in the input file getting bunched together i.e a single key was generated by the groupby function.
groups = groupby(datafile, key=lambda k: next(count()) // chunk)
I'm learning Python on my own, so any help or pointers to reference materials /PyCon talks are much appreciated. Anything really!
itertools.count() is an infinite iterator of increasing integer numbers.
The lambda stores an instance as a keyword argument, so every time the lambda is called the local variable line references that object. next() advances an iterator, retrieving the next value:
>>> from itertools import count
>>> line = count()
>>> next(line)
0
>>> next(line)
1
>>> next(line)
2
>>> next(line)
3
So next(line) retrieves the next count in the sequence, and divides that value by chunk (taking only the integer portion of the division). The k argument is ignored.
Because integer division is used, the result of the lambda is going to be chunk repeats of an increasing integer; if chunk is 3, then you get 0 three times, then 1 three times, then 2 three times, etc:
>>> chunk = 3
>>> l = lambda k, line=count(): next(line) // chunk
>>> [l('ignored') for _ in range(10)]
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3]
>>> chunk = 4
>>> l = lambda k, line=count(): next(line) // chunk
>>> [l('ignored') for _ in range(10)]
[0, 0, 0, 0, 1, 1, 1, 1, 2, 2]
It is this resulting value that groupby() groups the datafile iterable by, producing groups of chunk lines.
When looping over the groupby() results with for k, group in groups:, k is the number that the lambda produced and the results are grouped by; the for loop in the code ignores this. group is an iterable of lines from datafile, and will always contain chunk lines.
In response to the updated OP...
The itertools.groupby iterator offers ways to group items together, giving more control when a key function is defined. See more on how itertools.groupby() works.
The lambda function, is a functional, shorthand way of writing a regular function. For example:
>>> keyfunc = lambda k, line=count(): next(line)
Is equivalent to this regular function:
>>> def keyfunc(k, line=count()):
... return next(line) // chunk
Keywords: iterator, functional programming, anonymous functions
Details
Why is line=count() passed as an argument to the lambda function every time?
The reason is the same for normal functions. The line parameter by itself is a positional argument. When a value is assigned, it becomes a default keyword argument. See more on positional vs. keyword arguments.
You can still define line=count() outside the function by assigning the result to a keyword argument:
>>> chunk = 3
>>> line=count()
>>> keyfunc = lambda k, line=line: next(line) // chunk # make `line` a keyword arg
>>> [keyfunc("") for _ in range(10)]
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3]
>>> [keyfunc("") for _ in range(10)]
[3, 3, 4, 4, 4, 5, 5, 5, 6, 6] # note `count()` continues
... calling next on count() directly within the lambda expression, resulted in all the lines in the input file getting bunched together i.e a single key was generated by the groupby function ...
Try the following experiment with count():
>>> numbers = count()
>>> next(numbers)
0
>>> next(numbers)
1
>>> next(numbers)
2
As expected, you will notice next() is yielding the next item from the count() iterator. (A similar function is called iterating an iterator with a for loop). What is unique here is that generators do not reset - next() simply gives the next item in the line (as seen in the former example).
#Martijn Pieters pointed out next(line) // chunk computes a floored integer that is used by groupby to identify each line (bunching similar lines with similar ids together), which is also expected. See the references for more on how groupby works.
References
Docs for itertools.count
Docs for itertools.groupby()
Beazley, D. and Jones, B. "7.7 Capturing Variables in Anonymous Functions," Python Cookbook, 3rd ed. O'Reilly. 2013.

Multi Dimensional List - Sum Integer Element X by Common String Element Y

I have a multi dimensional list:
multiDimList = [['a',1],['a',1],['a',1],['b',2],['c',3],['c',3]]
I'm trying to sum the instances of element [1] where element [0] is common.
To put it more clearly, my desired output is another multi dimensional list:
multiDimListSum = [['a',3],['b',2],['c',6]]
I see I can access, say the value '2' in multiDimList by
x = multiDimList [3][1]
so I can grab the individual elements, and could probably build some sort of function to do this job, but it'd would be disgusting.
Does anyone have a suggestion of how to do this pythonically?
Assuming your actual sequence has similar elements grouped together as in your example (all instances of 'a', 'b' etc. together), you can use itertools.groupby() and operator.itemgetter():
from itertools import groupby
from operator import itemgetter
[[k, sum(v[1] for v in g)] for k, g in groupby(multiDimList, itemgetter(0))]
# result: [['a', 3], ['b', 2], ['c', 6]]
Zero Piraeus's answer covers the case when field entries are grouped in order. If they're not, then the following is short and reasonably efficient.
from collections import Counter
reduce(lambda c,x: c.update({x[0]: x[1]}) or c, multiDimList, Counter())
This returns a collection, accessible by element name. If you prefer it as a list you can call the .items() method on it, but note that the order of the labels in the output may be different from the order in the input even in the cases where the input was consistently ordered.
You could use a dict to accumulate the total associated to each string
d = {}
multiDimList = [['a',1],['a',1],['a',1],['b',2],['c',3],['c',3]]
for string, value in multiDimList:
# Retrieves the current value in the dict if it exists or 0
current_value = d.get(string, 0)
d[string] += value
print d # {'a': 3, 'b': 2, 'c': 6}
You can then access the value for b by using d["b"].

Iterate over a pair of iterables, sorted by an attribute

One way (the fastest way?) to iterate over a pair of iterables a and b in sorted order is to chain them and sort the chained iterable:
for i in sorted(chain(a, b)):
print i
For instance, if the elements of each iterable are:
a: 4, 6, 1
b: 8, 3
then this construct would produce elements in the order
1, 3, 4, 6, 8
However, if the iterables iterate over objects, this sorts the objects by their memory address. Assuming each iterable iterates over the same type of object,
What is the fastest way to iterate over a particular
attribute of the objects, sorted by this attribute?
What if the attribute to be chosen differs between iterables? If iterables a and b both iterate over objects of type foo, which has attributes foo.x and foo.y of the same type, how could one iterate over elements of a sorted by x and b sorted by y?
For an example of #2, if
a: (x=4,y=3), (x=6,y=2), (x=1,y=7)
b: (x=2,y=8), (x=2,y=3)
then the elements should be produced in the order
1, 3, 4, 6, 8
as before. Note that only the x attributes from a and the y attributes from b enter into the sort and the result.
Tim Pietzcker has already answered for the case where you're using the same attribute for each iterable. If you're using different attributes of the same type, you can do it like this (using complex numbers as a ready-made class that has two attributes of the same type):
In Python 2:
>>> a = [1+4j, 7+0j, 3+6j, 9+2j, 5+8j]
>>> b = [2+5j, 8+1j, 4+7j, 0+3j, 6+9j]
>>> keyed_a = ((n.real, n) for n in a)
>>> keyed_b = ((n.imag, n) for n in b)
>>> from itertools import chain
>>> sorted_ab = zip(*sorted(chain(keyed_a, keyed_b), key=lambda t: t[0]))[1]
>>> sorted_ab
((1+4j), (8+1j), (3+6j), 3j, (5+8j), (2+5j), (7+0j), (4+7j), (9+2j), (6+9j))
Since in Python 3 zip() returns an iterator, we need to coerce it to a list before attempting to subscript it:
>>> # ... as before up to 'from itertools import chain'
>>> sorted_ab = list(zip(*sorted(chain(keyed_a, keyed_b), key=lambda t: t[0])))[1]
>>> sorted_ab
((1+4j), (8+1j), (3+6j), 3j, (5+8j), (2+5j), (7+0j), (4+7j), (9+2j), (6+9j))
Answer to question 1: You can provide a key attribute to sorted(). For example if you want to sort by the object's .name, then use
sorted(chain(a, b), key=lambda x: x.name)
As for question 2: I guess you'd need another attribute for each object (like foo.z, as suggested by Zero Piraeus) that can be accessed by sorted(), since that function has no way of telling where the object it's currently sorting used to come from. After all, it is receiving a new iterator from chain() which doesn't contain any information about whether the current element is from a or b.

An interesting code for obtaining unique values from a list

Say given a list s = [2,2,2,3,3,3,4,4,4]
I saw the following code being used to obtain unique values from s:
unique_s = sorted(unique(s))
where unique is defined as:
def unique(seq):
# not order preserving
set = {}
map(set.__setitem__, seq, [])
return set.keys()
I'm just curious to know if there is any difference between this and just doing list(set(s))? Both results in a mutable object with the same values.
I'm guessing this code is faster since it's only looping once rather than twice in the case of type conversion?
You should use the code you describe:
list(set(s))
This works on all Pythons from 2.4 (I think) to 3.3, is concise, and uses built-ins in an easy to understand way.
The function unique appears to be designed to work if set is not a built-in, which is true for Python 2.3. Python 2.3 is fairly ancient (2003). The unique function is also broken for the Python 3.x series, since dict.keys returns an iterator for Python 3.x.
For a sorted sequence you could use itertools unique_justseen() recipe to get unique values while preserving order:
from itertools import groupby
from operator import itemgetter
print map(itemgetter(0), groupby([2,2,2,3,3,3,4,4,4]))
# -> [2, 3, 4]
To remove duplicate items from a sorted sequence inplace (to leave only unique values):
def del_dups(sorted_seq):
prev = object()
pos = 0
for item in sorted_seq:
if item != prev:
prev = item
sorted_seq[pos] = item
pos += 1
del sorted_seq[pos:]
L = [2,2,2,3,3,3,4,4,4]
del_dups(L)
print L # -> [2, 3, 4]

Categories

Resources