How is frozenset equality implemented in Python? - python

How is frozenset equality implemented in CPython? In particular, I want to know how individual elements in the fronzenset are compared with each other and their total time complexity.
I took a look at set and frozenset difference in implementation and https://stackoverflow.com/a/51778365/5792721.
The answer in the second link covers steps before comparing individual elements in the set, but missing the actual comparison algorithm.

The implementation in question can be found here. The algorithm goes something like this, given that we want to check if two sets v and w are equal:
Check if their sizes are equal. If they are not, return false.
This references the size attribute of the PySetObject, so it's a constant time operation.
Check if both sets are frozensets. If so, and their hashes are different, return false.
Again, this references the hash attribute of the PySetObject, which is a constant time operation. This is possible because frozensets are immutable objects and hence have their hashes calculated when they are created.
Check if w is a subset of v.
This is done by iterating through each element of v and checking if it is present in w.
The iteration must be performed over all of v, and is therefore at worst linear in its size (if a differing element is found in the last position).
Membership testing for sets in general takes constant time in the average case and linear time in the worst case; combining the two gives time linear in the size of v in the average case and time proportional to len(v) * len(w) in the worst case.
This is, in a sense, the average case of the worst case, since it assumes that the first two checks always return true. If your data only rarely tends to have equal-size sets which also have the same hashes but contain different objects, then in the average case the comparison will still run in constant time.

Both set and frozenset use the set_richcompare() function for equality tests. There's no difference here.
set_richcompare(), as the other answers mention, compares sizes and then hashes. Then, it checks whether one is a subset of the other, which just loops over every element in set 1, and checks if it is in set 2. This check is otherwise identical to checking whether an element is in a set. This is done by hash-table lookup, with a linear search over items with equal hashes. The complexity for the existence check is O(1) amortized or O(len(s)) worst-case, so the check for set_issubset() has a complexity of O(len(s1)) amortized, or O(len(s1)*len(s2)) worst-case.

Related

Asymptotic complexity of converting list to set in Python

There's already a question regarding this, and the answer says that the asymptotic complexity is O(n). But I observed that if an unsorted list is converted into a set, the set can be printed out in a sorted order, which means that at some point in the middle of these operations the list has been sorted. Then, as any comparison sort has the lower bound of Omega(n lg n), the asymptotic complexity of this operation should also be Omega(n lg n). So what exactly is the complexity of this operation?
A set in Python is an unordered collection so any order you see is by chance. As both dict and set are implemented as hash tables in CPython, insertion is average case O(1) and worst case O(N).
So list(set(...)) is always O(N) and set(list(...)) is average case O(N).
You can browse the source code for set here.

Time complexity for adding elements to list vs set in python

Why does adding elements to a set take longer than adding elements to a list in python? I created a loop and iterated over 1000000 elements added it to a list and a set. List is consistently taking around 10 seconds vs set which is taking around 20 seconds.
Both operations are O(1) amortized time-complexity.
Appending elements to a list has a lower coefficient because it does not need to hash the element first, nor does it need to check/handle hash collisions.
In the case of adding x into a set, Python needs to compute hash(x) first, because keeping the hash of all elements is what allows sets to have fast O(1) membership checks (compared to O(n) membership checks for lists).
The time complexity for appending to a list is the same as for adding to a set - both are O(1) amortised operations, meaning on average they each take a constant amount of time, although occasionally the operation may take more than that constant amount of time in order to dynamically resize the array that the data is stored in.
However, just because both are O(1) doesn't mean they take the same amount of time:
Appending to a list should be faster, because a list is implemented as a dynamic array, so appending an element just requires writing that element at the right index (which is already known), and increasing the length by 1.
In contrast, adding to a set is slower, because it requires computing the hash of the element to find the index to start looking from, then testing indices in some sequence to see if the element is already there (by testing if there is an element at the index, and if so, whether it equals the element being inserted) until either finding it, or finding an empty space where the element should be added.

Time complexity of python "set.intersection" for n sets

I want to know the complexity of the set.intersection of python. I looked in the documentations and the online wikis for python, but I did not find the time complexity of this method for multiple sets.
The python wiki on time complexity lists a single intersection as O(min(len(s), len(t)) where s and t are the sizes of the sets and t is a set. (In English: the time is bounded by and linear in the size of the smaller set.)
Note: based on the comments below, this wiki entry had been be wrong if the argument passed is not a set. I've corrected the wiki entry.
If you have n sets (sets, not iterables), you'll do n-1 intersections and the time can be
(n-1)O(len(s)) where s is the set with the smallest size.
Note that as you do an intersection the result may get smaller, so although O is the worst case, in practice, the time will be better than this.
However, looking at the specific code this idea of taking the min() only applies to a single pair of sets and doesn't extend to multiple sets. So in this case, we have to be pessimistic and take s as the set with the largest size.

Why use hashes instead of test true equality?

I've recently been looking into Python's dictionaries (I believe they're called associate arrays in other languages) and was confused by a couple of the restrictions on its keys.
First, dict keys must be immutable. When I looked up the logic behind it the answer was that dictionaries work like hash tables to look up the values for keys, and thus immutable keys (if they're hashable at all) may change their hash value, causing problems when retrieving the value.
I understood why that was the case just fine, but I was still somewhat confused by what the point of using a hash table would be. If you simply didn't hash the keys and tested for true equality (assuming indentically constructed objects compare equal), you could replicate most of the functionality of the dictionary without that limitation just by using two lists.
So, I suppose that's my real question - what's the rationale behind using hashes to look up values instead of equality?
If I had to guess, it's likely simply because comparing integers is incredibly fast and optimized, whereas comparing instances of other classes may not be.
You seem to be missing the whole point of a hash table, which is fast (O(1))1 retrieval, and which cannot be implemented without hashing, i.e. transformation of the key into some kind of well-distributed integer that is used as an index into a table. Notice that equality is still needed on retrieval to be able to handle hash collisions2, but you resort to it only when you already narrowed down the set of elements to those having a certain hash.
With just equality you could replicate similar functionality with parallel arrays or something similar, but that would make retrieval O(n)3; if you ask for strict weak ordering, instead, you can implement RB trees, that allow for O(log n) retrieval. But O(1) requires hashing.
Have a look at Wikipedia for some more insight on hash tables.
Notes
It can become O(n) in pathological scenarios (where all the elements get put in the same bucket), but that's not supposed to happen with a good hashing function.
Since different elements may have the same hash, after getting to the bucket corresponding to the hash you must check if you are actually retrieving the element associated with the given key.
Or O(log n) if you keep your arrays sorted, but that complicates insertion, which becomes on average O(n) due to shifting elements around; but then again, if you have ordering you probably want an RB tree or a heap.
By using hastables you achieve O(1) retrieval data, while comparing against each independent vale for equality will take O(n) (in a sequential search) or O(log(n)) in a binary search.
Also note that O(1) is amortized time, because if there are several values that hash to the same key, then a sequential search is needed among these values.

What makes sets faster than lists?

The python wiki says: "Membership testing with sets and dictionaries is much faster, O(1), than searching sequences, O(n). When testing "a in b", b should be a set or dictionary instead of a list or tuple."
I've been using sets in place of lists whenever speed is important in my code, but lately I've been wondering why sets are so much faster than lists. Could anyone explain, or point me to a source that would explain, what exactly is going on behind the scenes in python to make sets faster?
list: Imagine you are looking for your socks in your closet, but you don't know in which drawer your socks are, so you have to search drawer by drawer until you find them (or maybe you never do). That's what we call O(n), because in the worst scenario, you will look in all your drawers (where n is the number of drawers).
set: Now, imagine you're still looking for your socks in your closet, but now you know in which drawer your socks are, say in the 3rd drawer. So, you will just search in the 3rd drawer, instead of searching in all drawers. That's what we call O(1), because in the worst scenario you will look in just one drawer.
Sets are implemented using hash tables. Whenever you add an object to a set, the position within the memory of the set object is determined using the hash of the object to be added. When testing for membership, all that needs to be done is basically to look if the object is at the position determined by its hash, so the speed of this operation does not depend on the size of the set. For lists, in contrast, the whole list needs to be searched, which will become slower as the list grows.
This is also the reason that sets do not preserve the order of the objects you add.
Note that sets aren't faster than lists in general -- membership test is faster for sets, and so is removing an element. As long as you don't need these operations, lists are often faster.
I think you need to take a good look at a book on data structures. Basically, Python lists are implemented as dynamic arrays and sets are implemented as a hash tables.
The implementation of these data structures gives them radically different characteristics. For instance, a hash table has a very fast lookup time but cannot preserve the order of insertion.
While I have not measured anything performance related in python so far, I'd still like to point out that lists are often faster.
Yes, you have O(1) vs. O(n). But always remember that this gives information only about the asymptotic behavior of something. That means if your n is very high O(1) will always be faster - theoretically. In practice however n often needs to be much bigger than your usual data set will be.
So sets are not faster than lists per se, but only if you have to handle a lot of elements.
Python uses hashtables, which have O(1) lookup.
Basically, Depends on the operation you are doing …
*For adding an element - then a set doesn’t need to move any data, and all it needs to do is calculate a hash value and add it to a table. For a list insertion then potentially there is data to be moved.
*For deleting an element - all a set needs to do is remove the hash entry from the hash table, for a list it potentially needs to move data around (on average 1/2 of the data.
*For a search (i.e. an in operator) - a set just needs to calculate the hash value of the data item, find that hash value in the hash table, and if it is there - then bingo. For a list, the search has to look up each item in turn - on average 1/2 of all of the terms in the list. Even for many 1000s of items a set will be far quicker to search.
Actually sets are not faster than lists in every scenario. Generally the lists are faster than sets. But in the case of searching for an element in a collection, sets are faster because sets have been implemented using hash tables. So basically Python does not have to search the full set, which means that the time complexity in average is O(1). Lists use dynamic arrays and Python needs to check the full array to search. So it takes O(n).
So finally we can see that sets are better in some case and lists are better in some cases. Its up to us to select the appropriate data structure according to our task.
A list must be searched one by one, where a set or dictionary has an index for faster searching.

Categories

Resources