Python == with or vs. in list comparison - python

When checking for equality, is there any actual difference between speed and functionality of the following:
number = 'one'
if number == 'one' or number == 'two':
vs.
number = 'one'
if number in ['one', 'two']:

If the values are literal constants (as in this case), in is likely to run faster, as the (extremely limited) optimizer converts it to a constant tuple which is loaded all at once, reducing the bytecode work performed to two cheap loads, and a single comparison operation/conditional jump, where chained ors involve two cheap loads and a comparison op/conditional jump for each test.
For two values, it might not help as much, but as the number of values increases, the byte code savings over the alternative (especially if hits are uncommon, or evenly distributed across the options) can be meaningful.
The above applies specifically to the CPython reference interpreter; other interpreters may have lower per-bytecode costs that reduce or eliminate the differences in performance.
A general advantage comes in if number is a more complicated expression; my_expensive_function() in (...) will obviously outperform my_expensive_function() == A or my_expensive_function() == B, since the former only computes the value once.
That said, if the values in the tuple aren't constant literals, especially if hits will be common on the earlier values, in will usually be more expensive (because it must create the sequence for testing every time, even if it ends up only testing the first value).

Talking about functionality - no, these two approaches generally differ: see https://stackoverflow.com/a/41957167/747744

Related

Python consistent hash replacement

As noted by many, Python's hash is not consistent anymore (as of version 3.3), as a random PYTHONHASHSEED is now used by default (to address security concerns, as explained in this excellent answer).
However, I have noticed that the hash of some objects are still consistent (as of Python 3.7 anyway): that includes int, float, tuple(x), frozenset(x) (as long as x yields consistent hash). For example:
assert hash(10) == 10
assert hash((10, 11)) == 3713074054246420356
assert hash(frozenset([(0, 1, 2), 3, (4, 5, 6)])) == -8046488914261726427
Is that always true and guaranteed? If so, is that expected to stay that way? Is the PYTHONHASHSEED only applied to salt the hash of strings and byte arrays?
Why am I asking?
I have a system that relies on hashing to remember whether or not we have seen a given dict (in any order): {key: tuple(ints)}. In that system, the keys are a collection of filenames and the tuples a subset of os.stat_result, e.g. (size, mtime) associated with them. This system is used to make update/sync decisions based on detecting differences.
In my application, I have on the order of 100K such dicts, and each can represent several thousands of files and their state, so the compactness of the cache is important.
I can tolerate the small false positive rate (< 10^-19 for 64-bit hashes) coming from possible hash collisions (see also birthday paradox).
One compact representation is the following for each such dict "fsd":
def fsd_hash(fsd: dict):
return hash(frozenset(fsd.items()))
It is very fast and yields a single int to represent an entire dict (with order-invariance). If anything in the fsd dict changes, with high probability the hash will be different.
Unfortunately, hash is only consistent within a single Python instance, rendering it useless for hosts to compare their respective hashes. Persisting the full cache ({location_name: fsd_hash}) to disk to be reloaded on restart is also useless.
I cannot expect the larger system that uses that module to have been invoked with PYTHONHASHSEED=0, and, to my knowledge, there is no way to change this once the Python instance has started.
Things I have tried
I may use hashlib.sha1 or similar to calculate consistent hashes. This is slower and I can't directly use the frozenset trick: I have to iterate through the dict in a consistent order (e.g. by sorting on keys, slow) while updating the hasher. In my tests on real data, I see over 50x slow-down.
I could try applying an order-invariant hashing algorithm on consistent hashes obtained for each item (also slow, as starting a fresh hasher for each item is time-consuming).
I can try transforming everything into ints or tuples of ints and then frozensets of such tuples. At the moment, it seems that all int, tuple(int) and frozenset(tuple(int)) yield consistent hashes, but: is that guaranteed, and if so, how long can I expect this to be the case?
Additional question: more generally, what would be a good way to write a consistent hash replacement for hash(frozenset(some_dict.items())) when the dict contains various types and classes? I can implement a custom __hash__ (a consistent one) for the classes I own, but I cannot override str's hash for example. One thing I came up with is:
def const_hash(x):
if isinstance(x, (int, float, bool)):
pass
elif isinstance(x, frozenset):
x = frozenset([const_hash(v) for v in x])
elif isinstance(x, str):
x = tuple([ord(e) for e in x])
elif isinstance(x, bytes):
x = tuple(x)
elif isinstance(x, dict):
x = tuple([(const_hash(k), const_hash(v)) for k, v in x.items()])
elif isinstance(x, (list, tuple)):
x = tuple([const_hash(e) for e in x])
else:
try:
return x.const_hash()
except AttributeError:
raise TypeError(f'no known const_hash implementation for {type(x)}')
return hash(x)
Short answer to broad question: There are no explicit guarantees made about hashing stability aside from the overall guarantee that x == y requires that hash(x) == hash(y). There is an implication that x and y are both defined in the same run of the program (you can't perform x == y where one of them doesn't exist in that program obviously, so no guarantees are needed about the hash across runs).
Longer answers to specific questions:
Is [your belief that int, float, tuple(x), frozenset(x) (for x with consistent hash) have consistent hashes across separate runs] always true and guaranteed?
It's true of numeric types, with the mechanism being officially documented, but the mechanism is only guaranteed for a particular interpreter for a particular build. sys.hash_info provides the various constants, and they'll be consistent on that interpreter, but on a different interpreter (CPython vs. PyPy, 64 bit build vs. 32 bit build, even 3.n vs. 3.n+1) they can differ (documented to differ in the case of 64 vs. 32 bit CPython), so the hashes won't be portable across machines with different interpreters.
No guarantees on algorithm are made for tuple and frozenset; I can't think of any reason they'd change it between runs (if the underlying types are seeded, the tuple and frozenset benefit from it without needing any changes), but they can and do change the implementation between releases of CPython (e.g. in late 2018 they made a change to reduce the number of hash collisions in short tuples of ints and floats), so if you store off the hashes of tuples from say, 3.7, and then compute hashes of the same tuples in 3.8+, they won't match (even though they'd match between runs on 3.7 or between runs on 3.8).
If so, is that expected to stay that way?
Expected to, yes. Guaranteed, no. I could easily see seeded hashes for ints (and by extension, for all numeric types to preserve the numeric hash/equality guarantees) for the same reason they seeded hashes for str/bytes, etc. The main hurdles would be:
It would almost certainly be slower than the current, very simple algorithm.
By documenting the numeric hashing algorithm explicitly, they'd need a long period of deprecation before they could change it.
It's not strictly necessary (if web apps need seeded hashes for DoS protection, they can always just convert ints to str before using them as keys).
Is the PYTHONHASHSEED only applied to salt the hash of strings and byte arrays?
Beyond str and bytes, it applies to a number of random things that implement their own hashing in terms of the hash of str or bytes, often because they're already naturally convertable to raw bytes and are commonly used as keys in dicts populated by web-facing frontends. The ones I know of off-hand include the various classes of the datetime module (datetime, date, time, though this isn't actually documented in the module itself), and read-only memoryviews of with byte-sized formats (which hash equivalently to hashing the result of the view's .tobytes() method).
What would be a good way to write a consistent hash replacement for hash(frozenset(some_dict.items())) when the dict contains various types and classes?
The simplest/most composable solution would probably be to define your const_hash as a single dispatch function, using it the same way you do hash itself. This avoids having one single function defined in a single place that must handle all types; you can have the const_hash default implementation (which just relies on hash for those things with known consistent hashes) in a central location, and provide additional definitions for the built-in types you know aren't consistent (or which might contain inconsistent stuff) there, while still allowing people to extend the set of things it covers seamlessly by registering their own single-dispatch functions by importing your const_hash and decorating the implementation for their type with #const_hash.register. It's not significantly different in effect from your proposed const_hash, but it's a lot more manageable.

When CPython set `in` operator is O(n)?

I was reading about the time complexity of set operations in CPython and learned that the in operator for sets has the average time complexity of O(1) and worst case time complexity of O(n). I also learned that the worst case wouldn't occur in CPython unless the set's hash table's load factor is too high.
This made me wonder, when such a case would occur in the CPython implementation? Is there a simple demo code, which shows a set with clearly observable O(n) time complexity of the in operator?
Load factor is a red herring. In CPython sets (and dicts) automatically resize to keep the load factor under 2/3. There's nothing you can do in Python code to stop that.
O(N) behavior can occur when a great many elements have exactly the same hash code. Then they map to the same hash bucket, and set lookup degenerates to a slow form of linear search.
The easiest way to contrive such bad elements is to create a class with a horrible hash function. Like, e.g., and untested:
class C:
def __init__(self, val):
self.val = val
def __eq__(a, b):
return a.val == b.val
def __hash__(self):
return 3
Then hash(C(i)) == 3 regardless of the value of i.
To do the same with builtin types requires deep knowledge of their CPython implementation details. For example, here's a way to create an arbitrarily large number of distinct ints with the same hash code:
>>> import sys
>>> M = sys.hash_info.modulus
>>> set(hash(1 + i*M) for i in range(10000))
{1}
which shows that the ten thousand distinct ints created all have hash code 1.
You can view the set source here which can help: https://github.com/python/cpython/blob/723f71abf7ab0a7be394f9f7b2daa9ecdf6fb1eb/Objects/setobject.c#L429-L441
It's difficult to devise a specific example but the theory is fairly simple luckily :)
The set stores the keys using a hash of the value, as long as that hash is unique enough you'll end up with the O(1) performance as expected.
If for some weird reason all of your items have different data but the same hash, it collides and it will have to check all of them separately.
To illustrate, you can see the set as a dict like this:
import collection
your_set = collection.defaultdict(list)
def add(value):
your_set[hash(value)].append(value)
def contains(value):
# This is where your O(n) can occur, all values the same hash()
values = your_set.get(hash(value), [])
for v in values:
if v == value:
return True
return False
This a sometimes called the 'amortization' of a set or dictionary. It's shows up now and then as an interview question. As #TimPeters says resizing happens automagically at 2/3 capacity, so you'll only hit O(n) if you force the hash, yourself.
In computer science, amortized analysis is a method for analyzing a given algorithm's complexity, or how much of a resource, especially time or memory, it takes to execute. The motivation for amortized analysis is that looking at the worst-case run time per operation, rather than per algorithm, can be too pessimistic.
`/* GROWTH_RATE. Growth rate upon hitting maximum load.
* Currently set to used*3.
* This means that dicts double in size when growing without deletions,
* but have more head room when the number of deletions is on a par with the
* number of insertions. See also bpo-17563 and bpo-33205.
*
* GROWTH_RATE was set to used*4 up to version 3.2.
* GROWTH_RATE was set to used*2 in version 3.3.0
* GROWTH_RATE was set to used*2 + capacity/2 in 3.4.0-3.6.0.
*/
#define GROWTH_RATE(d) ((d)->ma_used*3)`
More to the efficiency point. Why 2/3 ? The Wikipedia article has a nice graph
https://upload.wikimedia.org/wikipedia/commons/1/1c/Hash_table_average_insertion_time.png
accompanying the article . (linear probing curve corresponds to O(1) to O(n) for our purposes, chaining is a more complicated hashing approach)
See https://en.wikipedia.org/wiki/Hash_table
for the complete
Say you have a set or dictionary which is stable, and is at 2/3 - 1 of it underlying capacity. Do you really want sluggish performance forever? You may wish to force resizing it upwards.
"if the keys are always known in advance, you can store them in a set and build your dictionaries from the set using dict.fromkeys()." plus some other useful if dated observations. Improving performance of very large dictionary in Python
For a good read on dictresize(): (dict was in Python before set)
https://github.com/python/cpython/blob/master/Objects/dictobject.c#L415

Why is `word == word[::-1]` to test for palindrome faster than a more algorithmic solution in Python?

I wrote a disaster of a question on Code Review asking why Python programmers normally test if a string is a palindrome by comparing the string to itself reversed, instead of a more algorithmic way with lower complexity, assuming that the normal way would be faster.
Here is the pythonic way:
def is_palindrome_pythonic(word):
# The slice requires N operations, plus memory
# and the equality requires N operations in the worst case
return word == word[::-1]
Here is my attempt at a more efficient way to accomplish this:
def is_palindrome_normal(word):
# This requires N/2 operations in the worst case
low = 0
high = len(word) - 1
while low < high:
if word[low] != word[high]:
return False
low += 1
high -= 1
return True
I would expect the normal way would be faster than the pythonic way. See for example this great article
Timing it with timeit, however, brought exactly the opposite result:
setup = '''
def is_palindrome_pythonic(word):
# ...
def is_palindrome_normal(word):
# ...
# N here is 2000
first_half = ''.join(map(str, (i for i in range(1000))))
word = first_half + first_half[::-1]
'''
timeit.timeit('is_palindrome_pythonic(word)', setup=setup, number=1000)
# 0.0052
timeit.timeit('is_palindrome_normal(word)', setup=setup, number=1000)
# 0.4268
I then figured that my n was too small, so I changed the length of word from 2000 to 2,000,000. The pythonic way took about 16 seconds on average, whereas the normal way ran several minutes before I canceled it.
Incidentally, in the best case scenario, where the very first letter does not match the very last letter, the normal algorithm was much faster.
What explains the extreme difference between the speeds of the two algorithms?
Because the "Pythonic" way with slicing is implemented in C. The interpreter / VM doesn't need to execute more than approximately once. The bulk of the algorithm is spent in a tight loop of native code.
As much as I love Python, I have to say that if you want maximum speed you probably shouldn't be using Python. ;)
The rule of thumb in Python time optimization is to use operators or module functions that do the bulk of the work at C speed rather than equivalent code running at Python speed. Even if the two equivalent approaches are using algorithms with the same big-O complexity, the time scaling factor of (mostly) running directly on the CPU vs running on the Python virtual machine has a big impact.
This is even true of an algorithm that's mostly just integer arithmetic, since Python integers are immutable objects, so when you do arithmetic there's the overhead of allocating and initialising a new integer object and disposing of the old one. CPython tries to be frugal, and is pretty smart at managing memory (so every new object doesn't require a system call to allocate memory), and of course the CPython interpreter maintains a cache of integers from -5 to 256 (inclusive) so that arithmetic with small numbers isn't so bad. But it's certainly slower than doing arithmetic at C speed with machine integers.
You can see the difference even with a simple counting loop. On my admittedly ancient 32 bit machine running Python 3.6, using the Bash time command to do the timings,
m = 5000000
for i in range(m):
i
is roughly twice as fast as
m = 5000000
i = 0
while i<m:
i += 1
because range can do the arithmetic at C speed, even though it still has to create a new integer object on each iteration. If you replace the i line in the range version with pass the time is roughly halved.
With more complicated algorithms the time differences can be much more significant, eg string or list copying that happens at the C level can often be done with efficient CPU operators that are much faster than chugging along on the Python virtual machine with Python code.
I agree that this can take a while to get used to if you come from a language that gets compiled to native machine code. And I admit that even after over 10 years of using Python it still feels a little weird to me that when (for example) you need to do some bit manipulation stuff that it can often be faster in Python to do it using string operations on a string composed of '0's and '1's that to do it using the traditional bitwise and arithmetic integer operators.
OTOH, I think it's useful to know the traditional algorithms as well as the Pythonic ones. It's rare that a programmer will work only in Python, so it's good to know how to do things in languages that don't work the way that Python does.

Loop until steady-state of a complex data structure in Python

I have a more-or-less complex data structure (list of dictionaries of sets) on which I perform a bunch of operations in a loop until the data structure reaches a steady-state, ie. doesn't change anymore. The number of iterations it takes to perform the calculation varies wildly depending on the input.
I'd like to know if there's an established way for forming a halting condition in this case. The best I could come up with is pickling the data structure, storing its md5 and checking if it has changed from the previous iteration. Since this is more expensive than my operations I only do this every 20 iterations but still, it feels wrong.
Is there a nicer or cheaper way to check for deep equality so that I know when to halt?
Thanks!
Take a look at python-deep. It should do what you want, and if it's not fast enough you can modify it yourself.
It also very much depends on how expensive the compare operation and how expensive one calculation iteration is. Say, one calculation iteration takes c time and one test takes t time and the chance of termination is p then the optimal testing frequency is:
(t * p) / c
That is assuming c < t, if that's not true then you should obviously check every loop.
So, since you can dynamically can track c and t and estimate p (with possible adaptions in the code if the code suspects the calculation is going to end) you can set your test frequency to an optimal value.
I think your only choices are:
Have every update mark a "dirty flag" when it alters a value from its starting state.
Doing a whole structure analysis (like the pickle/md5 combination you suggested).
Just run a fixed number of iterations known to reach a steady state (possibly running too many times but not having the overhead of checking the termination condition).
Option 1 is analogous to what Python itself does with ref-counting. Option 2 is analogous to what Python does with its garbage collector. Option 3 is common in numerical analysis (i.e. run divide-and-average 20 times to compute a square root).
Checking for equality to me doesn't seem the right way to go. Provided that you have full control over the operations you perform, I would introduce a "modified" flag (boolean variable) that is set to false at the beginning of each iteration. Whenever one of your operation modifies (part of) your data structure, it is set to true, and repetition is performed until modified remained "false" throughout a complete iteration.
I would trust the python equality operator to be reasonably efficient for comparing compositions of built-in objects.
I expect it would be faster than pickling+hashing, provided python tests for list equality something like this:
def __eq__(a,b):
if type(a) == list and type(b) == list:
if len(a) != len(b):
return False
for i in range(len(a)):
if a[i] != b[i]:
return False
return True
#testing for other types goes here
Since the function returns as soon as it finds two elements that don't match, in the average case it won't need to iterate through the whole thing. Compare to hashing, which does need to iterate through the whole data structure, even in the best case.
Here's how I would do it:
import copy
def perform_a_bunch_of_operations(data):
#take care to not modify the original data, as we will be using it later
my_shiny_new_data = copy.deepcopy(data)
#do lots of math here...
return my_shiny_new_data
data = get_initial_data()
while(True):
nextData = perform_a_bunch_of_operations(data)
if data == nextData: #steady state reached
break
data = nextData
This has the disadvantage of having to make a deep copy of your data each iteration, but it may still be faster than hashing - you can only know for sure by profiling your particular case.

Is python's "set" stable?

The question arose when answering to another SO question (there).
When I iterate several times over a python set (without changing it between calls), can I assume it will always return elements in the same order? And if not, what is the rationale of changing the order ? Is it deterministic, or random? Or implementation defined?
And when I call the same python program repeatedly (not random, not input dependent), will I get the same ordering for sets?
The underlying question is if python set iteration order only depends on the algorithm used to implement sets, or also on the execution context?
There's no formal guarantee about the stability of sets. However, in the CPython implementation, as long as nothing changes the set, the items will be produced in the same order. Sets are implemented as open-addressing hashtables (with a prime probe), so inserting or removing items can completely change the order (in particular, when that triggers a resize, which reorganizes how the items are laid out in memory.) You can also have two identical sets that nonetheless produce the items in different order, for example:
>>> s1 = {-1, -2}
>>> s2 = {-2, -1}
>>> s1 == s2
True
>>> list(s1), list(s2)
([-1, -2], [-2, -1])
Unless you're very certain you have the same set and nothing touched it inbetween the two iterations, it's best not to rely on it staying the same. Making seemingly irrelevant changes to, say, functions you call inbetween could produce very hard to find bugs.
A set or frozenset is inherently an unordered collection. Internally, sets are based on a hash table, and the order of keys depends both on the insertion order and on the hash algorithm. In CPython (aka standard Python) integers less than the machine word size (32 bit or 64 bit) hash to themself, but text strings, bytes strings, and datetime objects hash to integers that vary randomly; you can control that by setting the PYTHONHASHSEED environment variable.
From the __hash__ docs:
Note
By default, the __hash__() values of str, bytes and datetime
objects are “salted” with an unpredictable random value. Although they
remain constant within an individual Python process, they are not
predictable between repeated invocations of Python.
This is intended to provide protection against a denial-of-service
caused by carefully-chosen inputs that exploit the worst case
performance of a dict insertion, O(n^2) complexity. See
http://www.ocert.org/advisories/ocert-2011-003.html for details.
Changing hash values affects the iteration order of dicts, sets and
other mappings. Python has never made guarantees about this ordering
(and it typically varies between 32-bit and 64-bit builds).
See also PYTHONHASHSEED.
The results of hashing objects of other classes depend on the details of the class's __hash__ method.
The upshot of all this is that you can have two sets containing identical strings but when you convert them to lists they can compare unequal. Or they may not. ;) Here's some code that demonstrates this. On some runs, it will just loop, not printing anything, but on other runs it will quickly find a set that uses a different order to the original.
from random import seed, shuffle
seed(42)
data = list('abcdefgh')
a = frozenset(data)
la = list(a)
print(''.join(la), a)
while True:
shuffle(data)
lb = list(frozenset(data))
if lb != la:
print(''.join(data), ''.join(lb))
break
typical output
dachbgef frozenset({'d', 'a', 'c', 'h', 'b', 'g', 'e', 'f'})
deghcfab dahcbgef
And when I call the same python
program repeatedly (not random, not
input dependent), will I get the same
ordering for sets?
I can answer this part of the question now after a quick experiment. Using the following code:
class Foo(object) :
def __init__(self,val) :
self.val = val
def __repr__(self) :
return str(self.val)
x = set()
for y in range(500) :
x.add(Foo(y))
print list(x)[-10:]
I can trigger the behaviour that I was asking about in the other question. If I run this repeatedly then the output changes, but not on every run. It seems to be "weakly random" in that it changes slowly. This is certainly implementation dependent so I should say that I'm running the macports Python2.6 on snow-leopard. While the program will output the same answer for long runs of time, doing something that affects the system entropy pool (writing to the disk mostly works) will somethimes kick it into a different output.
The class Foo is just a simple int wrapper as experiments show that this doesn't happen with sets of ints. I think that the problem is caused by the lack of __eq__ and __hash__ members for the object, although I would dearly love to know the underlying explanation / ways to avoid it. Also useful would be some way to reproduce / repeat a "bad" run. Does anyone know what seed it uses, or how I could set that seed?
It’s definitely implementation defined. The specification of a set says only that
Being an unordered collection, sets do not record element position or order of insertion.
Why not use OrderedDict to create your own OrderedSet class?
The answer is simply a NO.
Python set operation is NOT stable.
I did a simple experiment to show this.
The code:
import random
random.seed(1)
x=[]
class aaa(object):
def __init__(self,a,b):
self.a=a
self.b=b
for i in range(5):
x.append(aaa(random.choice('asf'),random.randint(1,4000)))
for j in x:
print(j.a,j.b)
print('====')
for j in set(x):
print(j.a,j.b)
Run this for twice, you will get this:
First time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
a 2030
a 2332
f 1555
a 1045
s 1935
Process finished with exit code 0
Second time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
s 1935
a 2332
a 1045
f 1555
a 2030
Process finished with exit code 0
The reason is explained in comments in this answer.
However, there are some ways to make it stable:
set PYTHONHASHSEED to 0, see details here, here and here.
Use OrderedDict instead.
As pointed out, this is strictly an implementation detail.
But as long as you don’t change the structure between calls, there should be no reason for a read-only operation (= iteration) to change with time: no sane implementation does that. Even randomized (= non-deterministic) data structures that can be used to implement sets (e.g. skip lists) don’t change the reading order when no changes occur.
So, being rational, you can safely rely on this behaviour.
(I’m aware that certain GCs may reorder memory in a background thread but even this reordering will not be noticeable on the level of data structures, unless a bug occurs.)
The definition of a set is unordered, unique elements ("Unordered collections of unique elements"). You should care only about the interface, not the implementation. If you want an ordered enumeration, you should probably put it into a list and sort it.
There are many different implementations of Python. Don't rely on undocumented behaviour, as your code could break on different Python implementations.

Categories

Resources