Comparing two generators in Python

Comparing two generators in Python - python

I am wondering about the use of == when comparing two generators
For example:
x = ['1','2','3','4','5']
gen_1 = (int(ele) for ele in x)
gen_2 = (int(ele) for ele in x)
gen_1 and gen_2 are the same for all practical purposes, and yet when I compare them:
>>> gen_1 == gen_2
False
My guess here is that == here is treated like is normally is, and since gen_1 and gen_2 are located in different places in memory:
>>> gen_1
<generator object <genexpr> at 0x01E8BAA8>
>>> gen_2
<generator object <genexpr> at 0x01EEE4B8>
their comparison evaluates to False. Am I right on this guess? And any other insight is welcome.
And btw, I do know how to compare two generators:
>>> all(a == b for a,b in zip(gen_1, gen_2))
True
or even
>>> list(gen_1) == list(gen_2)
True
But if there is a better way, I'd love to know.

You are right with your guess – the fallback for comparison of types that don't define == is comparison based on object identity.
A better way to compare the values they generate would be
from itertools import zip_longest, tee
sentinel = object()
all(a == b for a, b in zip_longest(gen_1, gen_2, fillvalue=sentinel))
(For Python 2.x use izip_longest instead of zip_longest)
This can actually short-circuit without necessarily having to look at all values. As pointed out by larsmans in the comments, we can't use zip() here since it might give wrong results if the generators produce a different number of elements – zip() will stop on the shortest iterator. We use a newly created object instance as fill value for zip_longest(), since object instances compare unequal to any sane value that could appear in one of the generators (including other object instances).
Note that there is no way to compare generators without changing their state. You could store the items that were consumed if you need them later on:
gen_1, gen_1_teed = tee(gen_1)
gen_2, gen_2_teed = tee(gen_2)
all(a == b for a, b in zip_longest(gen_1, gen_2, fillvalue=sentinel))
This will give leave the state of gen_1 and gen_2 essentially unchanged. All values consumed by all() are stored inside the tee object.
At that point, you might ask yourself if it is really worth it to use lazy generators for the application at hand -- it might be better to simply convert them to lists and work with the lists instead.

Because generators generate their values on-demand, there isn't any way to "compare" them without actually consuming them. And if your generators generate an infinite sequence of values, such an equality test as you propose would be useless.

== is indeed the same as is on two generators, because that's the only check that can be made without changing their state and thus losing elements.
list(gen_1) == list(gen_2)
is the reliable and general way of comparing two finite generators (but obviously consumes both); your zip-based solution fails when they do not generate an equal numbers of elements:
>>> list(zip([1,2,3,4], [1,2,3]))
[(1, 1), (2, 2), (3, 3)]
>>> all(a == b for a, b in zip([1,2,3,4], [1,2,3]))
True
The list-based solution still fails when either generator generates an infinite number of elements. You can devise a workaround for that, but when both generators are infinite, you can only devise a semi-algorithm for non-equality.

In order to do an item-wise comparison of two generators as with lists and other containers, Python would have to consume them both entirely (well, the shorter one, anyway). I think it's good that you must do this explicitly, especially since one or the other may be infinite.

Related

Why does Python `zip()` yield nothing when given no iterables?

As zip yields as many values as the shortest iterable given, I would have expected passing zero arguments to zip to return an iterable yielding infinitely many tuples, instead of returning an empty iterable.
This would have been consistent with how other monoidal operations behave:
>>> sum([]) # sum
0
>>> math.prod([]) # product
1
>>> all([]) # logical conjunction
True
>>> any([]) # logical disjunction
False
>>> list(itertools.product()) # Cartesian product
[()]
For each of these operations, the value returned when given no arguments the identity value for the operation, which is to say, one that does not modify the result when included in the operation:
sum(xs) == sum([*xs, 0]) == sum([*xs, sum()])
math.prod(xs) == math.prod([*xs, 1]) == math.prod([*xs, math.prod()])
all(xs) == all([*xs, True]) == all([*xs, all()])
any(xs) == any([*xs, False]) == any([*xs, any()])
Or at least, one that gives a trivially isomorphic result:
itertools.product(*xs, itertools.product()) ≡
≡ itertools.product(*xs, [()]) ≡
≡ (*x, ()) for x in itertools.product(*xs)
In the case of zip, this would have been:
zip(*xs, zip()) ≡ f(x) for x in zip(*xs)
Because zip returns an n-tuple when given n arguments, it follows that zip() with 0 arguments must yield 0-tuples, i.e. (). This forces f to return (*x, ()) and therefore zip() to be equivalent to itertools.repeat(()). Another, more general law is:
((*x, *y) for x, y in zip(zip(*xs), zip(*ys)) ≡ zip(*xs, *ys)
which would have then held for all xs and ys, including when either xs or ys is empty (and does hold for itertools.product).
Yielding empty tuples indefinitely is also the behaviour that falls out of this straightforward reimplementation:
def my_zip(*iterables):
iterables = tuple(map(iter, iterables))
while True:
item = []
for it in iterables:
try:
item.append(next(it))
except StopIteration:
return
yield tuple(item)
which means that the case of zip with no arguments must have been specifically special-cased not to do that.
Why is zip() not equivalent to itertools.repeat(()) despite all the above?

PEP 201 and related discussion show that zip() with no arguments originally raised an exception. It was changed to return an empty list because this is more convenient for some cases of zip(*s) where s turns out to be an empty list. No consideration was given to what might be the 'identity', which in any case appears difficult to define with respect to zip - there is nothing you can zip with arbitrary x that will return x.
The original reasons for certain commutative and associative mathematical functions applied to an empty list to return the identity by default are not clear, but may have been driven by convenience, principle of least astonishment, and the history of earlier languages like Perl or ABC. Explicit reference to the concept of mathematical identity is rarely if ever made (see e.g. Reason for "all" and "any" result on empty lists). So there is no reason to rely on functions in general to do this. In many cases it would be less surprising for them to raise an exception instead.

Using comprehensions instead of a for loop

The following is a simplified example of my code.
>>> def action(num):
print "Number is", num
>>> items = [1, 3, 6]
>>> for i in [j for j in items if j > 4]:
action(i)
Number is 6
My question is the following: is it bad practice (for reasons such as code clarity) to simply replace the for loop with a comprehension which will still call the action function? That is:
>>> (action(j) for j in items if j > 2)
Number is 6

This shouldn't use a generator or comprehension at all.
def action(num):
print "Number is", num
items = [1, 3, 6]
for j in items:
if j > 4:
action(i)
Generators evaluate lazily. The expression (action(j) for j in items if j > 2) will merely return a generator expression to the caller. Nothing will happen in it unless you explicitly exhaust it. List comprehensions evaluate eagerly, but, in this particular case, you are left with a list with no purpose. Just use a regular loop.

This is bad practice. Firstly, your code fragment does not produce the desired output. You would instead get something like: <generator object <genexpr> at 0x03D826F0>.
Secondly, a list comprehension is for creating sequences, and generators a for creating streams of objects. Typically, they do not have side effects. Your action function is a prime example of a side effect -- it prints its input and returns nothing. Rather, a generator should for each item it generates, take an input and compute some output. eg.
doubled_odds = [x*2 for x in range(10) if x % 2 != 0]
By using a generator you are obfuscating the purpose of your code, which is to mutate global state (printing something), and not to create a stream of objects.
Whereas, just using a for loop makes the code slightly longer (basically just more whitespace), but immediately you can see that the purpose is to apply function to a selection of items (as opposed to creating a new stream/list of items).
for i in items:
if i < 4:
action(i)
Remember that generators are still looping constructs and that the underlying bytecode is more or less the same (if anything, generators are marginally less efficient), and you lose clarity. Generators and list comprehensions are great, but this is not the right situation for them.

While I personally favour Tigerhawk's solution, there might be a middle ground between his and willywonkadailyblah's solution (now deleted).
One of willywonkadailyblah's points was:
Why create a new list instead of just using the old one? You already have the condition to filter out the correct elements, so why put them away in memory and come back for them?
One way to avoid this problem is to use lazy evaluation of the filtering i.e. have the filtering done only when iterating using the for loop by making the filtering part of a generator expression rather than a list comprehension:
for i in (j for j in items if j > 4):
action(i)
Output
Number is 6
In all honesty, I think Tigerhawk's solution is the best for this, though. This is just one possible alternative.
The reason that I proposed this is that it reminds me a lot of LINQ queries in C#, where you define a lazy way to extract, filter and project elements from a sequence in one statement (the LINQ expression) and can then use a separate for each loop with that query to perform some action on each element.

Test if all N variables are different

I want to make a condition where all selected variables are not equal.
My solution so far is to compare every pair which doesn't scale well:
if A!=B and A!=C and B!=C:
I want to do the same check for multiple variables, say five or more, and it gets quite confusing with that many. What can I do to make it simpler?

Create a set and check whether the number of elements in the set is the same as the number of variables in the list that you passed into it:
>>> variables = [a, b, c, d, e]
>>> if len(set(variables)) == len(variables):
... print("All variables are different")
A set doesn't have duplicate elements so if you create a set and it has the same number of elements as the number of elements in the original list then you know all elements are different from each other.

If you can hash your variables (and, uh, your variables have a meaningful __hash__), use a set.
def check_all_unique(li):
unique = set()
for i in li:
if i in unique: return False #hey I've seen you before...
unique.add(i)
return True #nope, saw no one twice.
O(n) worst case. (And yes, I'm aware that you can also len(li) == len(set(li)), but this variant returns early if a match is found)
If you can't hash your values (for whatever reason) but can meaningfully compare them:
def check_all_unique(li):
li.sort()
for i in range(1,len(li)):
if li[i-1] == li[i]: return False
return True
O(nlogn), because sorting. Basically, sort everything, and compare pairwise. If two things are equal, they should have sorted next to each other. (If, for some reason, your __cmp__ doesn't sort things that are the same next to each other, 1. wut and 2. please continue to the next method.)
And if ne is the only operator you have....
import operator
import itertools
li = #a list containing all the variables I must check
if all(operator.ne(*i) for i in itertools.combinations(li,2)):
#do something
I'm basically using itertools.combinations to pair off all the variables, and then using operator.ne to check for not-equalness. This has a worst-case time complexity of O(n^2), although it should still short-circuit (because generators, and all is lazy). If you are absolutely sure that ne and eq are opposites, you can use operator.eq and any instead.
Addendum: Vincent wrote a much more readable version of the itertools variant that looks like
import itertools
lst = #a list containing all the variables I must check
if all(a!=b for a,b in itertools.combinations(lst,2)):
#do something
Addendum 2: Uh, for sufficiently large datasets, the sorting variant should possibly use heapq. Still would be O(nlogn) worst case, but O(n) best case. It'd be something like
import heapq
def check_all_unique(li):
heapq.heapify(li) #O(n), compared to sorting's O(nlogn)
prev = heapq.heappop(li)
for _ in range(len(li)): #O(n)
current = heapq.heappop(li) #O(logn)
if current == prev: return False
prev = current
return True

Put the values into a container type. Then just loop trough the container, comparing each value. It would take about O(n^2).
pseudo code:
a[0] = A; a[1] = B ... a[n];
for i = 0 to n do
for j = i + 1 to n do
if a[i] == a[j]
condition failed

You can enumerate a list and check that all values are the first occurrence of that value in the list:
a = [5, 15, 20, 65, 48]
if all(a.index(v) == i for i, v in enumerate(a)):
print "all elements are unique"
This allows for short-circuiting once the first duplicate is detected due to the behaviour of Python's all() function.
Or equivalently, enumerate a list and check if there are any values which are not the first occurrence of that value in the list:
a = [5, 15, 20, 65, 48]
if not any(a.index(v) != i for i, v in enumerate(a)):
print "all elements are unique"

How to remove and return element in python list

In python you can do list.pop(i) which removes and returns the element in index i, but is there a built in function like list.remove(e) where it removes and returns the first element equal to e?
Thanks

I mean, there is list.remove, yes.
>>> x = [1,2,3]
>>> x.remove(1)
>>> x
[2, 3]
I don't know why you need it to return the removed element, though. You've already passed it to list.remove, so you know what it is... I guess if you've overloaded __eq__ on the objects in the list so that it doesn't actually correspond to some reasonable notion of equality, you could have problems. But don't do that, because that would be terrible.
If you have done that terrible thing, it's not difficult to roll your own function that does this:
def remove_and_return(lst, item):
return lst.pop(lst.index(item))

Is there a builtin? No. Probably because if you already know the element you want to remove, then why bother returning it?1
The best you can do is get the index, and then pop it. Ultimately, this isn't such a big deal -- Chaining 2 O(n) algorithms is still O(n), so you still scale roughly the same ...
def extract(lst, item):
idx = lst.index(item)
return lst.pop(idx)
1Sure, there are pathological cases where the item returned might not be the item you already know... but they aren't important enough to warrant a new method which takes only 3 lines to write yourself :-)

Strictly speaking, you would need something like:
def remove(lst, e):
i = lst.index(e)
# error if e not in lst
a = lst[i]
lst.pop(i)
return a
Which would make sense only if e == a is true, but e is a is false, and you really need a instead of e.
In most case, though, I would say that this suggest something suspicious in your code.
A short version would be :
a = lst.pop(lst.index(e))

Startswith for lists in python?

Is there an equivalent of starts with for lists in python ?
I would like to know if a list a starts with a list b. like
len(a) >= len(b) and a[:len(b)] == b ?

You can just write a[:len(b)] == b
if len(b) > len(a), no error will be raised.

For large lists, this will be more efficient:
from itertools import izip
...
result = all(x==y for (x, y) in izip(a, b))
For small lists, your code is fine. The length check can be omitted, as DavidK said, but it would not make a big difference.
PS: No, there's no build-in to check if a list starts with another list, but as you already know, it's trivial to write such a function yourself.

it does not get much simpler than what you have (and the check on the lengths is not even needed)...
for an overview of more extended/elegant options for finding sublists in lists, you can check out the main answer to this
post : elegant find sub-list in list

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Comparing two generators in Python - python

Because generators generate their values on-demand, there isn't any way to "compare" them without actually consuming them. And if your generators generate an infinite sequence of values, such an equality test as you propose would be useless.

In order to do an item-wise comparison of two generators as with lists and other containers, Python would have to consume them both entirely (well, the shorter one, anyway). I think it's good that you must do this explicitly, especially since one or the other may be infinite.

Related

Why does Python `zip()` yield nothing when given no iterables?

Using comprehensions instead of a for loop

Test if all N variables are different

How to remove and return element in python list

Startswith for lists in python?

Categories

Resources