Dictionary size reduces upon increasing one element

Dictionary size reduces upon increasing one element - python

I ran this:
import sys
diii = {'key1':1,'key2':2,'key3':1,'key4':2,'key5':1,'key6':2,'key7':1}
print sys.getsizeof(diii)
# output: 1048
diii = {'key1':1,'key2':2,'key3':1,'key4':2,'key5':1,'key6':2,'key7':1,'key8':2}
print sys.getsizeof(diii)
# output: 664
Before asking here, I restarted my python shell and tried it online too and got the same result.
I thought a dictionary with one more element will either give same bytes as output or more, than the one containing one less element.
Any idea what am I doing wrong?

Previous answers have already mentioned that you needn't worry, so I will dive into some more technical details. It's long, but please bear with me.
TLDR: this has to do with arithmetic of resizing. Each resize allocates 2**i memory, where 2**i > requested_size; 2**i >= 8, but then each insert resizes the underlying table further if 2/3 of slots are filled, but this time the new_size = old_size * 4. This way, your first dictionary ends up with 32 cells allocated while the second one with as little as 16 (as it got a bigger initial size upfront).
Answer: As #snakecharmerb noted in the comments this depends on the way that the dictionary is created. For the sake of brevity, let me refer you to this, excellent blog post which explains the differences between the dict() constructor and the dict literal {} on both Python bytecode and CPython implementation levels.
Let's start with the magic number of 8 keys. It turns out to be a constant, predefined for Python's 2.7 implementation in dictobject.h headers file
- the minimal size of the Python dictionary:
/* PyDict_MINSIZE is the minimum size of a dictionary. This many slots are
* allocated directly in the dict object (in the ma_smalltable member).
* It must be a power of 2, and at least 4. 8 allows dicts with no more
* than 5 active entries to live in ma_smalltable (and so avoid an
* additional malloc); instrumentation suggested this suffices for the
* majority of dicts (consisting mostly of usually-small instance dicts and
* usually-small dicts created to pass keyword arguments).
*/
#define PyDict_MINSIZE 8
As such, it may differ between the specific Python implementations, but let's assume that we all use the same CPython version. However, the dict of size 8 is expected to neatly contain only 5 elements; don't worry about this, as this specific optimization is not as important for us as it seems.
Now, when you create the dictionary using the dict literal {}, CPython takes a shortcut (as compared to the explicit creation when calling dict constructor). Simplifying a bit the bytecode operation BUILD_MAP gets resolved and it results in calling the _PyDict_NewPresized function which will construct a dictionary for which we already know the size upfront:
/* Create a new dictionary pre-sized to hold an estimated number of elements.
Underestimates are okay because the dictionary will resize as necessary.
Overestimates just mean the dictionary will be more sparse than usual.
*/
PyObject *
_PyDict_NewPresized(Py_ssize_t minused)
{
PyObject *op = PyDict_New();
if (minused>5 && op != NULL && dictresize((PyDictObject *)op, minused) == -1) {
Py_DECREF(op);
return NULL;
}
return op;
}
This function calls the normal dict constructor (PyDict_New) and requests a resize of the newly created dict - but only if it is expected to hold more than 5 elements. This is due to an optimization which allows Python to speed up some things by holding the data in the pre-allocated "smalltable", without invoking expensive memory allocation and de-allocation functions.
Then, the dictresize will try to determine the minimal size of the new dictionary. It will also use the magic number 8 - as the starting point and iteratively multiply by 2 until it finds the minimal size larger than the requested size. For the first dictionary, this is simply 8, however, for the second one (and all dictionaries created by dict literal with less than 15 keys) it is 16.
Now, in the dictresize function there is a special case for the former, smaller new_size == 8, which is meant to bring forward the aforementioned optimization (using the "small table" to reduce memory manipulation operations). However, because there is no need to resize the newly created dict (e.g. no elements were removed so far thus the table is "clean") nothing really happens.
On the contrary, when the new_size != 8, a usual procedure of reallocating the hash table follows. This ends up with a new table being allocated to store the
"big" dictionary. While this is intuitive (the bigger dict got a bigger table), this does not seem to move us forward to the observed behavior yet - but, please bear with me one more moment.
Once we have the pre-allocated dict, STORE_MAP optcodes tell the interpreter to insert consecutive key-value pairs. This is implemented with dict_set_item_by_hash_or_entry function, which - importantly - resizes the dictionary after each increase in size (i.e. successful insertion) if more than 2/3 of the slots are already used up. The size will increase x4 (in our case, for large dicts only by x2).
So here is what happens when you create the dict with 7 elements:
# note 2/3 = 0.(6)
BUILD_MAP # initial_size = 8, filled = 0
STORE_MAP # 'key_1' ratio_filled = 1/8 = 0.125, not resizing
STORE_MAP # 'key_2' ratio_filled = 2/8 = 0.250, not resizing
STORE_MAP # 'key_3' ratio_filled = 3/8 = 0.375, not resizing
STORE_MAP # 'key_4' ratio_filled = 4/8 = 0.500, not resizing
STORE_MAP # 'key_5' ratio_filled = 5/8 = 0.625, not resizing
STORE_MAP # 'key_6' ratio_filled = 6/8 = 0.750, RESIZING! new_size = 8*4 = 32
STORE_MAP # 'key_7' ratio_filled = 7/32 = 0.21875
And you end up with a dict having a total size of 32 elements in the hash table.
However, when adding eight elements the initial size will be twice bigger (16), thus we will never resize as the condition ratio_filled > 2/3 will never be satisfied!
And that's why you end up with a smaller table in the second case.

sys.getsizeof returns the memory allocated to the underlying hash table implementation of those dictionaries, which has a somewhat non-obvious relationship with the actual size of the dictionary.
The CPython implementation of Python 2.7 quadruples the amount of memory allocated to a hash table each time it's filled up to 2/3 of its capacity, but shrinks it if it has over allocated memory to it (i.e. a large contiguous block of memory has been allocated but only a few addresses were actually used).
It just so happens that dictionaries that have between 8 and 11 elements allocate just enough memory for CPython to consider them 'over-allocated', and get shrunk.

You're not doing anything wrong. The size of a dictionary doesn't exactly correspond to the number of elements, as dictionaries are overallocated and dynamically resized once a certain percentage of their memory space is used. I'm not sure what makes the dict smaller in 2.7 (it doesn't in 3) in your example, but you don't have to worry about it. Why are you using 2.7 and why do you want to know the exact memory usage of the dict (which btw doesn't include the memory used by the variables contained in the dictionary, as the dictionary itself is filled with pointers.

Allocation of dict literals is handled here: dictobject.c#L685-L695.
Due to quirks of the implementation, size vs number of elements does not end up being monotonically increasing.
import sys
def getsizeof_dict_literal(n):
pairs = ["{0}:{0}".format(i) for i in range(n)]
dict_literal = "{%s}" % ", ".join(pairs)
source = "sys.getsizeof({})".format(dict_literal)
size = eval(source)
return size
The odd growing-and-shrinking behaviour exhibited is not just a weird once-off accident, it's a regularly repeating occurrence. For the first few thousand results, the visualization looks like this:
In more recent versions of Python, the dict implementation is completely different and allocation details are more sane. See bpo28731 - _PyDict_NewPresized() creates too small dict, for an example of some recent changes. In Python 3.7.3, the visualization now looks like this with smaller dicts in general and a monotonic allocation:

You are actually not doing anything wrong. getsizeof doesn't get the size of the elements inside the dictionary but the rough estimate of the dictionary. The alternative way for this problem would be json.dumps() from the json library. Though it doesn't give you the actual size of the object, it is consistent with the changes you make to the object.
Here's an example
import sys
import json
diii = {'key1':1,'key2':2,'key3':1,'key4':2,'key5':1,'key6':2,'key7':1}
print sys.getsizeof(json.dumps(diii)) # <----
diii = {'key1':1,'key2':2,'key3':1,'key4':2,'key5':1,'key6':2,'key7':1,'key8':2}
print sys.getsizeof(json.dumps(diii)) # <----
json.dumps() changes the dictionary into a json string, then diii can be evaluated as a string.
Read more about python's json library here

Related

Can cython optimize python dictionary memory and lookup speed as well?

I have a class which primarily contains the three dicts:
class KB(object):
def __init__(self):
# key:str value: list of str
linear_patterns = defaultdict(list)
# key:str value: list of str
nonlinear_patterns = defaultdict(list)
# key: str value: dict
pattern_meta_info = {}
...
self.__initialize()
def __initialize(self):
# the 3 dicts are populated
...
The size of the 3 dicts are below:
linear_patterns: 100,000
non_linear_patterns: 900,000
pattern_meta_info: 700,000
After the program is run and done, it takes about 15 seconds to release the memory. When I reduces the number of the dict sizes above by loading less data in initialization, the memory release is faster, so I judge it's due to these dict sizes that cause memory release slower. The total program takes about 8G memory. Also, after the dicts are built, all operations are lookup, no modifications.
Is there a way to use cython to optimize the 3 data structures above, especially in terms of memory usage? Is there a similar cython dictionary that can replaces the python dicts?

It seems unlikely that a different dictionary or object type would change much. Destructor performance is dominated by the memory allocator. That will be roughly the same unless you switch to a different malloc implementation.
If this is only about object destruction at the end of your program, most languages (but not Python) would allow you to use call exit while keeping the KB object alive. The OS will release the memory much quicker when the process terminates. So why bother? Unfortunately that doesn't work with Python's sys.exit() since this merely raises an exception.
Everything else relies on changing the data structure or algorithm. Are your strings highly redundant? Maybe you can reuse string objects by interning them. Keep them in a shared set to use the same string in multiple places. A simple call to string = sys.intern(string) is enough. Unlike in earlier versions of Python, this will not keep the string object alive beyond its use so you don't run the risk of leaking memory in a long-running process.
You could also pool the strings in one large allocation. If access is relatively rare, you could change the class to use one large io.StringIO object for its contained strings and all dictionaries just deal with (offset, length) tuples into that buffer.
That still leaves many tuple and integer objects but those use specialized allocators that may be faster. Also, the length integer will come from the common pool of small integers and not allocate a new object.
A final thought: 8 GB of string data. You sure you don't want a small sqlite or dbm database? Could be a temporary file

When CPython set `in` operator is O(n)?

I was reading about the time complexity of set operations in CPython and learned that the in operator for sets has the average time complexity of O(1) and worst case time complexity of O(n). I also learned that the worst case wouldn't occur in CPython unless the set's hash table's load factor is too high.
This made me wonder, when such a case would occur in the CPython implementation? Is there a simple demo code, which shows a set with clearly observable O(n) time complexity of the in operator?

Load factor is a red herring. In CPython sets (and dicts) automatically resize to keep the load factor under 2/3. There's nothing you can do in Python code to stop that.
O(N) behavior can occur when a great many elements have exactly the same hash code. Then they map to the same hash bucket, and set lookup degenerates to a slow form of linear search.
The easiest way to contrive such bad elements is to create a class with a horrible hash function. Like, e.g., and untested:
class C:
def __init__(self, val):
self.val = val
def __eq__(a, b):
return a.val == b.val
def __hash__(self):
return 3
Then hash(C(i)) == 3 regardless of the value of i.
To do the same with builtin types requires deep knowledge of their CPython implementation details. For example, here's a way to create an arbitrarily large number of distinct ints with the same hash code:
>>> import sys
>>> M = sys.hash_info.modulus
>>> set(hash(1 + i*M) for i in range(10000))
{1}
which shows that the ten thousand distinct ints created all have hash code 1.

You can view the set source here which can help: https://github.com/python/cpython/blob/723f71abf7ab0a7be394f9f7b2daa9ecdf6fb1eb/Objects/setobject.c#L429-L441
It's difficult to devise a specific example but the theory is fairly simple luckily :)
The set stores the keys using a hash of the value, as long as that hash is unique enough you'll end up with the O(1) performance as expected.
If for some weird reason all of your items have different data but the same hash, it collides and it will have to check all of them separately.
To illustrate, you can see the set as a dict like this:
import collection
your_set = collection.defaultdict(list)
def add(value):
your_set[hash(value)].append(value)
def contains(value):
# This is where your O(n) can occur, all values the same hash()
values = your_set.get(hash(value), [])
for v in values:
if v == value:
return True
return False

This a sometimes called the 'amortization' of a set or dictionary. It's shows up now and then as an interview question. As #TimPeters says resizing happens automagically at 2/3 capacity, so you'll only hit O(n) if you force the hash, yourself.
In computer science, amortized analysis is a method for analyzing a given algorithm's complexity, or how much of a resource, especially time or memory, it takes to execute. The motivation for amortized analysis is that looking at the worst-case run time per operation, rather than per algorithm, can be too pessimistic.
`/* GROWTH_RATE. Growth rate upon hitting maximum load.
* Currently set to used*3.
* This means that dicts double in size when growing without deletions,
* but have more head room when the number of deletions is on a par with the
* number of insertions. See also bpo-17563 and bpo-33205.
*
* GROWTH_RATE was set to used*4 up to version 3.2.
* GROWTH_RATE was set to used*2 in version 3.3.0
* GROWTH_RATE was set to used*2 + capacity/2 in 3.4.0-3.6.0.
*/
#define GROWTH_RATE(d) ((d)->ma_used*3)`
More to the efficiency point. Why 2/3 ? The Wikipedia article has a nice graph
https://upload.wikimedia.org/wikipedia/commons/1/1c/Hash_table_average_insertion_time.png
accompanying the article . (linear probing curve corresponds to O(1) to O(n) for our purposes, chaining is a more complicated hashing approach)
See https://en.wikipedia.org/wiki/Hash_table
for the complete
Say you have a set or dictionary which is stable, and is at 2/3 - 1 of it underlying capacity. Do you really want sluggish performance forever? You may wish to force resizing it upwards.
"if the keys are always known in advance, you can store them in a set and build your dictionaries from the set using dict.fromkeys()." plus some other useful if dated observations. Improving performance of very large dictionary in Python
For a good read on dictresize(): (dict was in Python before set)
https://github.com/python/cpython/blob/master/Objects/dictobject.c#L415

Why does one version leak memory but not the other ? (Python)

Both these functions compute the same thing (the numbers of integers such that the length of the associated Collatz sequence is no greater than n) in essentially the same way. The only difference is that the first one uses sets exclusively whereas the second uses both sets and lists.
The second one leaks memory (in IDLE with Python 3.2, at least), the first one does not, and I have no idea why. I have tried a few "tricks" (such as adding del statements) but nothing seems to help (which is not surprising, those tricks should be useless).
I would be grateful to anybody who could help me understand what goes on.
If you want to test the code, you should probably use a value of n in the 55 to 65 range, anything above 75 will almost certainly result in a (totally expected) memory error.
def disk(n):
"""Uses sets for explored, current and to_explore. Does not leak."""
explored = set()
current = {1}
for i in range(n):
to_explore = set()
for x in current:
if not (x-1) % 3 and ((x-1)//3) % 2 and not ((x-1)//3) in explored:
to_explore.add((x-1)//3)
if not 2*x in explored:
to_explore.add(2*x)
explored.update(current)
current = to_explore
return len(explored)
def disk_2(n):
"""Does exactly the same thing, but Uses a set for explored and lists for
current and to_explore.
Leaks (like a sieve :))
"""
explored = set()
current = [1]
for i in range(n):
to_explore = []
for x in current:
if not (x-1) % 3 and ((x-1)//3) % 2 and not ((x-1)//3) in explored:
to_explore.append((x-1)//3)
if not 2*x in explored:
to_explore.append(2*x)
explored.update(current)
current = to_explore
return len(explored)
EDIT : This also happens when using the interactive mode of the interpreter (without IDLE), but not when running the script directly from a terminal (in that case, memory usage goes back to normal some time after the function has returned, or as soon as there is an explicit call to gc.collect()).

CPython allocates small objects (obmalloc.c, 3.2.3) out of 4 KiB pools that it manages in 256 KiB blocks called arenas. Each active pool has a fixed block size ranging from 8 bytes up to 256 bytes, in steps of 8. For example, a 14-byte object is allocated from the first available pool that has a 16-byte block size.
There's a potential problem if arenas are allocated on the heap instead of using mmap (this is tunable via mallopt's M_MMAP_THRESHOLD), in that the heap cannot shrink below the highest allocated arena, which will not be released so long as 1 block in 1 pool is allocated to an object (CPython doesn't float objects around in memory).
Given the above, the following version of your function should probably solve the problem. Replace the line return len(explored) with the following 3 lines:
result = len(explored)
del i, x, to_explore, current, explored
return result + 0
After deallocating the containers and all referenced objects (releasing arenas back to the system), this returns a new int with the expression result + 0. The heap cannot shrink as long as there's a reference to the first result object. In this case that gets automatically deallocated when the function returns.
If you're testing this interactively without the "plus 0" step, remember that the REPL (Read, Eval, Print, Loop) keeps a reference to the last result accessible via the pseudo-variable "_".
In Python 3.3 this shouldn't be an issue since the object allocator was modified to use anonymous mmap for arenas, where available. (The upper limit on the object allocator was also bumped to 512 bytes to accommodate 64-bit platforms, but that's inconsequential here.)
Regarding manual garbage collection, gc.collect() does a full collection of tracked container objects, but it also clears freelists of objects that are maintained by built-in types (e.g. frames, methods, floats). Python 3.3 added additional API functions to clear freelists used by lists (PyList_ClearFreeList), dicts (PyDict_ClearFreeList), and sets (PySet_ClearFreeList). If you'd prefer to keep the freelists intact, use gc.collect(1).

I doubt it leaks, I bet it is just that garbage collection doesn't kick in yet, so memory used keeps growing. This is because every round of outer loop, the previous current list becomes elgible for garbage collection, but will not be garbage collected until whenever.
Furthermore, even if it is garbage collected, memory isn't normally released back to the OS, so you have to use whatever Python method to get current used heap size.
If you add garbage collection at end of every outer loop iteration, that may reduce memory use a bit, or not, depending on how exactly Python handles its heap and garbage collection without that.

You do not have a memory leak. Processes on linux do not release memory to the OS until they exit. Accordingly, the stats you will see in e.g. top will only ever go up.
You only have a memory leak if after running the same, or smaller size of job, Python grabs more memory from the OS, when it "should" have been able to reuse the memory it was using for objects which "should" have been garbage collected.

Passing subset reference of array/list as an argument in Python

I'm kind of new (1 day) to Python so maybe my question is stupid. I've already looked here but I can't find my answer.
I need to modify the content of an array at a random offset with a random size.
I have a Python API to interface a DDL for an USB device which I can't modify. There is a function just like this one :
def read_device(in_array):
# The DLL accesses the USB device, reads it into in_array of type array('B')
# in_array can be an int (the function will create an array with the int
# size and return it), or can be an array or, can be a tuple (array, int)
In MY code, I create an array of, let's say, 64 bytes and I want to read 16 bytes starting from the 32rd byte. In C, I'd give &my_64_array[31] to the read_device function.
In python, if a give :
read_device(my_64_array[31:31+16])
it seems that in_array is a reference to a copy of the given subset, therefore my_64_array is not modified.
What can I do ? Do I have to split my_64_array and recombine it after ??

Seeing as how you are not able to update and/or change the API code. The best method is to pass the function a small temporary array that you then assign to your existing 64 byte array after the function call.
So that would be something like the following, not knowing the exact specifics of your API call.
the_64_array[31:31+16] = read_device(16)

It's precisely as you say, if you input a slice into a function it creates a reference copy of the slice.
Two possible methods to add it later (assuming read_device returns the relevant slice):
my_64_array = my_64_array[:32] + read_device(my_64_array[31:31+16]) + my_64_array[31+16:]
# equivalently, but around 33% faster for even small arrays (length 10), 3 times faster for (length 50)...
my_64_array[31:31+16] = read_device(my_64_array[31:31+16])
So I think you should be using the latter.
.
If it was a modifiable function (but it's not in this case!) you could be to change your functions arguments (one is the entire array):
def read_device(the_64_array, start=31, end=47):
# some code
the_64_array[start:end] = ... #modifies `in_array` in place
and call read_device(my_64_array) or read(my_64_array, 31, 31+16).

When reading a list subset you're calling __getitem__ with a slice(x, y) argument of that list. In your case these statements are equal:
my_64_array[31:31+16]
my_64_array.__getitem__(slice(31, 31+16))
This means that the __getitem__ function can be overridden in a subclass to obtain different behaviour.
You can also set the same subset using a[1:3] = [1,2,3] in which case it'd call a.__setitem__(slice(1, 3), [1,2,3])
So I'd suggest either of these:
pass the list (my_64_array) and a slice object to read_device instead of passing the result of __getitem__, after which you could read the necessary data and set the corresponding offsets. No subclassing. This is probably the best solution in terms of readability and ease of development.
subclassing list, overriding __getitem__ and __setitem__ to return instances of that subclass with a parent reference, and then change all modifying or reading methods of a list to reference a parent list instead. This might be a little tricky if you're new to python, but basically, you'd exploit that python list properties are largely defined by the methods inside a list instance. This is probably better in terms of performance as you can create references.
If read_device returns the resulting list, and that list is of equal size, you can do this: a[x:y] = read_device(a[x:y])

Is python's "set" stable?

The question arose when answering to another SO question (there).
When I iterate several times over a python set (without changing it between calls), can I assume it will always return elements in the same order? And if not, what is the rationale of changing the order ? Is it deterministic, or random? Or implementation defined?
And when I call the same python program repeatedly (not random, not input dependent), will I get the same ordering for sets?
The underlying question is if python set iteration order only depends on the algorithm used to implement sets, or also on the execution context?

There's no formal guarantee about the stability of sets. However, in the CPython implementation, as long as nothing changes the set, the items will be produced in the same order. Sets are implemented as open-addressing hashtables (with a prime probe), so inserting or removing items can completely change the order (in particular, when that triggers a resize, which reorganizes how the items are laid out in memory.) You can also have two identical sets that nonetheless produce the items in different order, for example:
>>> s1 = {-1, -2}
>>> s2 = {-2, -1}
>>> s1 == s2
True
>>> list(s1), list(s2)
([-1, -2], [-2, -1])
Unless you're very certain you have the same set and nothing touched it inbetween the two iterations, it's best not to rely on it staying the same. Making seemingly irrelevant changes to, say, functions you call inbetween could produce very hard to find bugs.

A set or frozenset is inherently an unordered collection. Internally, sets are based on a hash table, and the order of keys depends both on the insertion order and on the hash algorithm. In CPython (aka standard Python) integers less than the machine word size (32 bit or 64 bit) hash to themself, but text strings, bytes strings, and datetime objects hash to integers that vary randomly; you can control that by setting the PYTHONHASHSEED environment variable.
From the __hash__ docs:
Note
By default, the __hash__() values of str, bytes and datetime
objects are “salted” with an unpredictable random value. Although they
remain constant within an individual Python process, they are not
predictable between repeated invocations of Python.
This is intended to provide protection against a denial-of-service
caused by carefully-chosen inputs that exploit the worst case
performance of a dict insertion, O(n^2) complexity. See
http://www.ocert.org/advisories/ocert-2011-003.html for details.
Changing hash values affects the iteration order of dicts, sets and
other mappings. Python has never made guarantees about this ordering
(and it typically varies between 32-bit and 64-bit builds).
See also PYTHONHASHSEED.
The results of hashing objects of other classes depend on the details of the class's __hash__ method.
The upshot of all this is that you can have two sets containing identical strings but when you convert them to lists they can compare unequal. Or they may not. ;) Here's some code that demonstrates this. On some runs, it will just loop, not printing anything, but on other runs it will quickly find a set that uses a different order to the original.
from random import seed, shuffle
seed(42)
data = list('abcdefgh')
a = frozenset(data)
la = list(a)
print(''.join(la), a)
while True:
shuffle(data)
lb = list(frozenset(data))
if lb != la:
print(''.join(data), ''.join(lb))
break
typical output
dachbgef frozenset({'d', 'a', 'c', 'h', 'b', 'g', 'e', 'f'})
deghcfab dahcbgef

And when I call the same python
program repeatedly (not random, not
input dependent), will I get the same
ordering for sets?
I can answer this part of the question now after a quick experiment. Using the following code:
class Foo(object) :
def __init__(self,val) :
self.val = val
def __repr__(self) :
return str(self.val)
x = set()
for y in range(500) :
x.add(Foo(y))
print list(x)[-10:]
I can trigger the behaviour that I was asking about in the other question. If I run this repeatedly then the output changes, but not on every run. It seems to be "weakly random" in that it changes slowly. This is certainly implementation dependent so I should say that I'm running the macports Python2.6 on snow-leopard. While the program will output the same answer for long runs of time, doing something that affects the system entropy pool (writing to the disk mostly works) will somethimes kick it into a different output.
The class Foo is just a simple int wrapper as experiments show that this doesn't happen with sets of ints. I think that the problem is caused by the lack of __eq__ and __hash__ members for the object, although I would dearly love to know the underlying explanation / ways to avoid it. Also useful would be some way to reproduce / repeat a "bad" run. Does anyone know what seed it uses, or how I could set that seed?

It’s definitely implementation defined. The specification of a set says only that
Being an unordered collection, sets do not record element position or order of insertion.
Why not use OrderedDict to create your own OrderedSet class?

The answer is simply a NO.
Python set operation is NOT stable.
I did a simple experiment to show this.
The code:
import random
random.seed(1)
x=[]
class aaa(object):
def __init__(self,a,b):
self.a=a
self.b=b
for i in range(5):
x.append(aaa(random.choice('asf'),random.randint(1,4000)))
for j in x:
print(j.a,j.b)
print('====')
for j in set(x):
print(j.a,j.b)
Run this for twice, you will get this:
First time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
a 2030
a 2332
f 1555
a 1045
s 1935
Process finished with exit code 0
Second time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
s 1935
a 2332
a 1045
f 1555
a 2030
Process finished with exit code 0
The reason is explained in comments in this answer.
However, there are some ways to make it stable:
set PYTHONHASHSEED to 0, see details here, here and here.
Use OrderedDict instead.

As pointed out, this is strictly an implementation detail.
But as long as you don’t change the structure between calls, there should be no reason for a read-only operation (= iteration) to change with time: no sane implementation does that. Even randomized (= non-deterministic) data structures that can be used to implement sets (e.g. skip lists) don’t change the reading order when no changes occur.
So, being rational, you can safely rely on this behaviour.
(I’m aware that certain GCs may reorder memory in a background thread but even this reordering will not be noticeable on the level of data structures, unless a bug occurs.)

The definition of a set is unordered, unique elements ("Unordered collections of unique elements"). You should care only about the interface, not the implementation. If you want an ordered enumeration, you should probably put it into a list and sort it.
There are many different implementations of Python. Don't rely on undocumented behaviour, as your code could break on different Python implementations.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.