Can someone please explain the groupby operation and the lambda function being used on this SO post?
key=lambda k, line=count(): next(line) // chunk
import tempfile
from itertools import groupby, count
temp_dir = tempfile.mkdtemp()
def tempfile_split(filename, temp_dir, chunk=4000000):
with open(filename, 'r') as datafile:
# The itertools.groupby() function takes a sequence and a key function,
# and returns an iterator that generates pairs.
# Each pair contains the result of key_function(each item) and
# another iterator containing all the items that shared that key result.
groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk)
for k, group in groups:
print(key, list(group))
output_name = os.path.normpath(os.path.join(temp_dir + os.sep, "tempfile_%s.tmp" % k))
for line in group:
with open(output_name, 'a') as outfile:
outfile.write(line)
Edit: It took me a while to wrap my head around the lambda function used with groupby. I don't think I understood either of them very well.
Martijn explained it really well, however I have a follow up question. Why is line=count() passed as an argument to the lambda function every time? I tried assigning the variable line to count() just once, outside the function.
line = count()
groups = groupby(datafile, key=lambda k, line: next(line) // chunk)
and it resulted in TypeError: <lambda>() missing 1 required positional argument: 'line'
Also, calling next on count() directly within the lambda expression, resulted in all the lines in the input file getting bunched together i.e a single key was generated by the groupby function.
groups = groupby(datafile, key=lambda k: next(count()) // chunk)
I'm learning Python on my own, so any help or pointers to reference materials /PyCon talks are much appreciated. Anything really!
itertools.count() is an infinite iterator of increasing integer numbers.
The lambda stores an instance as a keyword argument, so every time the lambda is called the local variable line references that object. next() advances an iterator, retrieving the next value:
>>> from itertools import count
>>> line = count()
>>> next(line)
0
>>> next(line)
1
>>> next(line)
2
>>> next(line)
3
So next(line) retrieves the next count in the sequence, and divides that value by chunk (taking only the integer portion of the division). The k argument is ignored.
Because integer division is used, the result of the lambda is going to be chunk repeats of an increasing integer; if chunk is 3, then you get 0 three times, then 1 three times, then 2 three times, etc:
>>> chunk = 3
>>> l = lambda k, line=count(): next(line) // chunk
>>> [l('ignored') for _ in range(10)]
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3]
>>> chunk = 4
>>> l = lambda k, line=count(): next(line) // chunk
>>> [l('ignored') for _ in range(10)]
[0, 0, 0, 0, 1, 1, 1, 1, 2, 2]
It is this resulting value that groupby() groups the datafile iterable by, producing groups of chunk lines.
When looping over the groupby() results with for k, group in groups:, k is the number that the lambda produced and the results are grouped by; the for loop in the code ignores this. group is an iterable of lines from datafile, and will always contain chunk lines.
In response to the updated OP...
The itertools.groupby iterator offers ways to group items together, giving more control when a key function is defined. See more on how itertools.groupby() works.
The lambda function, is a functional, shorthand way of writing a regular function. For example:
>>> keyfunc = lambda k, line=count(): next(line)
Is equivalent to this regular function:
>>> def keyfunc(k, line=count()):
... return next(line) // chunk
Keywords: iterator, functional programming, anonymous functions
Details
Why is line=count() passed as an argument to the lambda function every time?
The reason is the same for normal functions. The line parameter by itself is a positional argument. When a value is assigned, it becomes a default keyword argument. See more on positional vs. keyword arguments.
You can still define line=count() outside the function by assigning the result to a keyword argument:
>>> chunk = 3
>>> line=count()
>>> keyfunc = lambda k, line=line: next(line) // chunk # make `line` a keyword arg
>>> [keyfunc("") for _ in range(10)]
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3]
>>> [keyfunc("") for _ in range(10)]
[3, 3, 4, 4, 4, 5, 5, 5, 6, 6] # note `count()` continues
... calling next on count() directly within the lambda expression, resulted in all the lines in the input file getting bunched together i.e a single key was generated by the groupby function ...
Try the following experiment with count():
>>> numbers = count()
>>> next(numbers)
0
>>> next(numbers)
1
>>> next(numbers)
2
As expected, you will notice next() is yielding the next item from the count() iterator. (A similar function is called iterating an iterator with a for loop). What is unique here is that generators do not reset - next() simply gives the next item in the line (as seen in the former example).
#Martijn Pieters pointed out next(line) // chunk computes a floored integer that is used by groupby to identify each line (bunching similar lines with similar ids together), which is also expected. See the references for more on how groupby works.
References
Docs for itertools.count
Docs for itertools.groupby()
Beazley, D. and Jones, B. "7.7 Capturing Variables in Anonymous Functions," Python Cookbook, 3rd ed. O'Reilly. 2013.
Related
I am trying to create a generator function that loops over an iterable sequence while eliminating duplicates and then returns each result in order one at a time (not as a set or list), but I am having difficulty getting it to work. I have found similar questions here, but the responses pretty uniformly result in a list being produced.
I would like the output to be something like:
>>> next(i)
2
>>> next(i)
8
>>> next(i)
4....
I was able to write it as a regular function that produces a list:
def unique(series):
new_series = []
for i in series:
if i not in new_series:
new_series.append(i)
return new_series
series = ([2,8,4,5,5,6,6,6,2,1])
print(unique(series))
I then tried rewriting it as a generator function by eliminating the lines that create a blank list and that append to that list, and then using "yield" instead of "return"; but I’m not getting it to work:
def unique(series):
for i in series:
if i not in new_series:
yield new_series
I don't know if I'm leaving something out or putting too much in. Thank you for any assistance.
Well, to put it simply, you need something to "remember" the values you find. In your first function you were using the new list itself, but in the second one you don't have it, so it fails. You can use a set() for this purpose.
def unique(series):
seen = set()
for i in series:
if i not in seen:
seen.add(i)
yield i
Also, yield should "yield" a single value at once, not the entire new list.
To print out the elements, you'll have to iterate on the generator. Simply doing print(unique([1, 2, 3])) will print the resulting generator object.
>>> print(unique([1, 1, 2, 3]))
<generator object unique at 0x1023bda98>
>>> print(*unique([1, 1, 2, 3]))
1 2 3
>>> for x in unique([1, 1, 2, 3]):
print(x)
1
2
3
Note: * in the second example is the iterable unpack operator.
Try this:
def unique(series):
new_se = []
for i in series:
if i not in new_se:
new_se.append(i)
new_se = list(dict.fromkeys(new_se)) # this will remove duplicates
return new_se
series = [2,8,4,5,5,6,6,6,2,1]
print(unique(series))
I was trying to use itertools.groupby to help me group a list of integers by positive or negative property, for example:
input
[1,2,3, -1,-2,-3, 1,2,3, -1,-2,-3]
will return
[[1,2,3],[-1,-2,-3],[1,2,3],[-1,-2,-3]]
However if I:
import itertools
nums = [1,2,3, -1,-2,-3, 1,2,3, -1,-2,-3]
group_list = list(itertools.groupby(nums, key=lambda x: x>=0))
print(group_list)
for k, v in group_list:
print(list(v))
>>>
[]
[-3]
[]
[]
But if I don't list() the groupby object, it will work fine:
nums = [1,2,3, -1,-2,-3, 1,2,3, -1,-2,-3]
group_list = itertools.groupby(nums, key=lambda x: x>=0)
for k, v in group_list:
print(list(v))
>>>
[1, 2, 3]
[-1, -2, -3]
[1, 2, 3]
[-1, -2, -3]
What I don't understand is, a groupby object is a iterator composed by a pair of key and _grouper object, a call of list() of a groupby object should not consume the _grouper object?
And even if it did consume, how did I get [-3] from the second element?
Per the docs, it is explicitly noted that advancing the groupby object renders the previous group unusable (in practice, empty):
The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list.
Basically, instead of list-ifying directly with the list constructor, you'd need a listcomp that converts from group iterators to lists before advancing the groupby object, replacing:
group_list = list(itertools.groupby(nums, key=lambda x: x>=0))
with:
group_list = [(k, list(g)) for k, g in itertools.groupby(nums, key=lambda x: x>=0)]
The design of most itertools module types is intended to avoid storing data implicitly, because they're intended to be used with potentially huge inputs. If all the groupers stored copies of all the data from the input (and the groupby object had to be sure to retroactively populate them), it would get ugly, and potentially blow memory by accident. By forcing you to make storing the values explicit, you don't accidentally store unbounded amounts of data unintentionally, per the Zen of Python:
Explicit is better than implicit.
I have the following program :
import string
import itertools
import multiprocessing as mp
def test(word_list):
return list(map(lambda xy: (xy[0], len(list(xy[1]))),
itertools.groupby(sorted(word_list))))
def f(x):
return (x[0], len(list(x[1])))
def test_parallel(word_list):
w = mp.cpu_count()
pool = mp.Pool(w)
return (pool.map(f, itertools.groupby(sorted(word_list))))
def main():
test_list = ["test", "test", "test", "this", "this", "that"]
print(test(test_list))
print(test_parallel(test_list))
return
if __name__ == "__main__":
main()
The output is :
[('test', 3), ('that', 1), ('this', 2)]
[('test', 0), ('that', 0), ('this', 1)]
The first line is the expected and correct result. My question is, why isn't pool.map() returning the same results as map()?
Also, I'm aware a 6 item list isn't the perfect case for multiprocessing. This is simply a demonstration of the issue I am having while implementing in a larger application.
I'm using Python 3.5.1.
From https://docs.python.org/3.5/library/itertools.html#itertools.groupby:
The returned group is itself an iterator that shares the underlying
iterable with groupby(). Because the source is shared, when the
groupby() object is advanced, the previous group is no longer visible.
So, if that data is needed later, it should be stored as a list:
groups = []
uniquekeys = []
data = sorted(data, key=keyfunc)
for k, g in groupby(data, keyfunc):
groups.append(list(g)) # Store group iterator as a list
uniquekeys.append(k)
I think the issue here is that Pool.map tries to chop up its input, and in doing so, it iterates through the result of groupby, which effectively skips over the elements from all but the last group.
One fix for your code would be to use something like [(k, list(v)) for k, v in itertools.groupby(sorted(word_list))], but I don't know how applicable that is to your real-world use case.
groupby() returns iterators per group, and these are not independent from the underlying iterator passed in. You can't independently iterate over these groups in parallel; any preceding group will be prematurely ended the moment you access the next.
pool.map() will try to read all of the groupby() iterator results to send those results to separate functions; merely trying to get a second group will cause the first to be empty.
You can see the same result without pool.map() simply by iterating to the next result from groupby():
>>> from itertools import groupby
>>> word_list = ["test", "test", "test", "this", "this", "that"]
>>> iterator = groupby(sorted(word_list))
>>> first = next(iterator)
>>> next(first[1])
'test'
>>> second = next(iterator)
>>> list(first[1])
[]
The remainder of the first group is 'empty' because the second group has been requested.
This is clearly documented:
Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible.
You'd have to 'materialise' each group before sending it to the the function:
return pool.map(lambda kg: f((k[0], list(kg[1]))), itertools.groupby(sorted(word_list)))
or
return pool.map(f, (
(key, list(group)) for key, group in itertools.groupby(sorted(word_list))))
where the generator expression takes care of the materialising as pool.map() iterates.
I would like to ask if/how could I rewrite those lines below, to run faster.
*(-10000, 10000) is just a range where I can be sure my numbers are between.
first = 10000
last = -10000
for key in my_data.keys():
if "LastFirst_" in key: # In my_data there are many more keys with lots of vals.
first = min(first, min(my_data[key]))
last = max(last, max(my_data[key]))
print first, last
Also, is there any pythonic way to write that (even if that wouldn't mean it will run faster)?
Thx
Use the * operator to unpack the values:
>>> my_data = {'LastFirst_1':[1, 4, 5], 'LastFirst_2':[2, 4, 6]}
>>> d = [item for k,v in my_data.items() if 'LastFirst_' in k for item in v]
>>> first = 2
>>> last = 5
>>> min(first, *d)
1
>>> max(last, *d)
6
You could use some comprehensions to simplify the code.
first = min(min(data) for (key, data) in my_data.items() if "LastFirst_" in key)
last = max(max(data) for (key, data) in my_data.items() if "LastFirst_" in key)
The min and max functions are overloaded to take either multiple values (as you use it), or one sequence of values, so you can pass in iterables (e.g. lists) and get the min or max of them.
Also, if you're only interested in the values, use .values() or itervalues(). If you're interested in both, use .items() or .iteritems(). (In Python 3, there is no .iter- version.)
If you have many sequences, you can use itertools.chain to make them one long iterable. You can also manually string them along using multiple for in a single comprehension, but that can be distasteful.
import itertools
def flatten1(iterables):
# The "list" is necessary, because we want to use this twice
# but `chain` returns an iterator, which can only be used once.
return list(itertools.chain(*iterables))
# Note: The "(" ")" indicates that this is an iterator, not a list.
valid_lists = (v for k,v in my_data.iteritems() if "LastFirst_" in k)
valid_values = flatten1(valid_lists)
# Alternative: [w for w in v for k,v in my_data.iteritems() if "LastFirst_" in k]
first = min(valid_values)
last = max(valid_values)
print first, last
If the maximum and minimum elements are NOT in the dict, then the coder should decide what to do, but I would suggest that they consider allowing the default behavior of max/min (probably a raised exception, or the None value), rather than try to guess the upper or lower bound. Either one would be more Pythonic.
In Python 3, you may specify a default argument, e.g. max(valid_values, default=10000).
my_data = {'LastFirst_a': [1, 2, 34000], 'LastFirst_b': [-12000, 1, 5]}
first = 10000
last = -10000
# Note: replace .items() with .iteritems() if you're using Python 2.
relevant_data = [el for k, v in my_data.items() for el in v if "LastFirst_" in k]
# maybe faster:
# relevant_data = [el for k, v in my_data.items() for el in v if k.startswith("LastFirst_")]
first = max(first, max(relevant_data))
last = min(last, min(relevant_data))
print(first, last)
values = [my_data[k] for k in my_data if 'LastKey_' in k]
flattened = [item for sublist in values for item in sublist]
min(first, min(flattened))
max(last, max(flattened))
or
values = [item for sublist in (j for a, j in d.iteritems() if 'LastKey_' in a) for item in sublist]
min(first, min(values))
max(last, max(values))
I was running some benchmarks and it seems that the second solution is slightly faster than the first.
However, I also compared these two versions with the code posted by other posters.
solution one: 0.648876905441
solution two: 0.634277105331
solution three (TigerhawkT3): 2.14495801926
solution four (Def_Os): 1.07884407043
solution five (leewangzhong): 0.635314941406
based on a randomly generated dictionary of 1 million keys.
I think that leewangzhong's solution is really good. Besides the timing shown above, in the next experiments it's resulting slightly faster than my second solution (we are talking about milliseconds, though), like:
solution one: 0.678879022598
solution two: 0.62641787529
solution three: 2.15943193436
solution four: 1.05863213539
solution five: 0.611482858658
Itertools is really a great module!
I read the documentation on next() and I understand it abstractly. From what I understand, next() is used as a reference to an iterable object and makes python cycle to the next iterable object sequentially. Makes sense! My question is, how is this useful outside the context of the builtin for loop? When would someone ever need to use next() directly? Can someone provide a simplistic example? Thanks mates!
As luck would have it, I wrote one yesterday:
def skip_letters(f, skip=" "):
"""Wrapper function to skip specified characters when encrypting."""
def func(plain, *args, **kwargs):
gen = f(p for p in plain if p not in skip, *args, **kwargs)
for p in plain:
if p in skip:
yield p
else:
yield next(gen)
return func
This uses next to get the return values from the generator function f, but interspersed with other values. This allows some values to be passed through the generator, but others to be yielded straight out.
There are many places where we can use next, for eg.
Drop the header while reading a file.
with open(filename) as f:
next(f) #drop the first line
#now do something with rest of the lines
Iterator based implementation of zip(seq, seq[1:])(from pairwise recipe iterools):
from itertools import tee, izip
it1, it2 = tee(seq)
next(it2)
izip(it1, it2)
Get the first item that satisfies a condition:
next(x for x in seq if x % 100)
Creating a dictionary using adjacent items as key-value:
>>> it = iter(['a', 1, 'b', 2, 'c', '3'])
>>> {k: next(it) for k in it}
{'a': 1, 'c': '3', 'b': 2}
next is useful in many different ways, even outside of a for-loop. For example, if you have an iterable of objects and you want the first that meets a condition, you can give it a generator expression like so:
>>> lst = [1, 2, 'a', 'b']
>>> # Get the first item in lst that is a string
>>> next(x for x in lst if isinstance(x, str))
'a'
>>> # Get the fist item in lst that != 1
>>> lst = [1, 1, 1, 2, 1, 1, 3]
>>> next(x for x in lst if x != 1)
2
>>>