How does Python 2.7 set work internally? [duplicate] - python

I've seen people say that set objects in python have O(1) membership-checking. How are they implemented internally to allow this? What sort of data structure does it use? What other implications does that implementation have?
Every answer here was really enlightening, but I can only accept one, so I'll go with the closest answer to my original question. Thanks all for the info!

According to this thread:
Indeed, CPython's sets are implemented as something like dictionaries
with dummy values (the keys being the members of the set), with some
optimization(s) that exploit this lack of values
So basically a set uses a hashtable as its underlying data structure. This explains the O(1) membership checking, since looking up an item in a hashtable is an O(1) operation, on average.
If you are so inclined you can even browse the CPython source code for set which, according to Achim Domma, was originally mostly a cut-and-paste from the dict implementation.
Note: Nowadays, set and dict's implementations have diverged significantly, so the precise behaviors (e.g. arbitrary order vs. insertion order) and performance in various use cases differs; they're still implemented in terms of hashtables, so average case lookup and insertion remains O(1), but set is no longer just "dict, but with dummy/omitted keys".

When people say sets have O(1) membership-checking, they are talking about the average case. In the worst case (when all hashed values collide) membership-checking is O(n). See the Python wiki on time complexity.
The Wikipedia article says the best case time complexity for a hash table that does not resize is O(1 + k/n). This result does not directly apply to Python sets since Python sets use a hash table that resizes.
A little further on the Wikipedia article says that for the average case, and assuming a simple uniform hashing function, the time complexity is O(1/(1-k/n)), where k/n can be bounded by a constant c<1.
Big-O refers only to asymptotic behavior as n → ∞.
Since k/n can be bounded by a constant, c<1, independent of n,
O(1/(1-k/n)) is no bigger than O(1/(1-c)) which is equivalent to O(constant) = O(1).
So assuming uniform simple hashing, on average, membership-checking for Python sets is O(1).

I think its a common mistake, set lookup (or hashtable for that matter) are not O(1).
from the Wikipedia
In the simplest model, the hash function is completely unspecified and the table does not resize. For the best possible choice of hash function, a table of size n with open addressing has no collisions and holds up to n elements, with a single comparison for successful lookup, and a table of size n with chaining and k keys has the minimum max(0, k-n) collisions and O(1 + k/n) comparisons for lookup. For the worst choice of hash function, every insertion causes a collision, and hash tables degenerate to linear search, with Ω(k) amortized comparisons per insertion and up to k comparisons for a successful lookup.
Related: Is a Java hashmap really O(1)?

We all have easy access to the source, where the comment preceding set_lookkey() says:
/* set object implementation
Written and maintained by Raymond D. Hettinger <python#rcn.com>
Derived from Lib/sets.py and Objects/dictobject.c.
The basic lookup function used by all operations.
This is based on Algorithm D from Knuth Vol. 3, Sec. 6.4.
The initial probe index is computed as hash mod the table size.
Subsequent probe indices are computed as explained in Objects/dictobject.c.
To improve cache locality, each probe inspects a series of consecutive
nearby entries before moving on to probes elsewhere in memory. This leaves
us with a hybrid of linear probing and open addressing. The linear probing
reduces the cost of hash collisions because consecutive memory accesses
tend to be much cheaper than scattered probes. After LINEAR_PROBES steps,
we then use open addressing with the upper bits from the hash value. This
helps break-up long chains of collisions.
All arithmetic on hash should ignore overflow.
Unlike the dictionary implementation, the lookkey function can return
NULL if the rich comparison returns an error.
*/
...
#ifndef LINEAR_PROBES
#define LINEAR_PROBES 9
#endif
/* This must be >= 1 */
#define PERTURB_SHIFT 5
static setentry *
set_lookkey(PySetObject *so, PyObject *key, Py_hash_t hash)
{
...

Sets in python employ hash table internally. Let us first talk about hash table.
Let there be some elements that you want to store in a hash table and you have 31 places in the hash table where you can do so. Let the elements be: 2.83, 8.23, 9.38, 10.23, 25.58, 0.42, 5.37, 28.10, 32.14, 7.31. When you want to use a hash table, you first determine the indices in the hash table where these elements would be stored. Modulus function is a popular way of determining these indices, so let us say we take one element at a time, multiply it by 100 and apply modulo by 31. It is important that each such operation on an element results in a unique number as an entry in a hash table can store only one element unless chaining is allowed. In this way, each element would be stored at a location governed by the indices obtained through modulo operation. Now if you want to search for an element in a set which essentially stores elements using this hash table, you would obtain the element in O(1) time as the index of the element is computed using the modulo operation in a constant time.
To expound on the modulo operation, let me also write some code:
piles = [2.83, 8.23, 9.38, 10.23, 25.58, 0.42, 5.37, 28.10, 32.14, 7.31]
def hash_function(x):
return int(x*100 % 31)
[hash_function(pile) for pile in piles]
Output: [4, 17, 8, 0, 16, 11, 10, 20, 21, 18]

To emphasize a little more the difference between set's and dict's, here is an excerpt from the setobject.c comment sections, which clarify's the main difference of set's against dicts.
Use cases for sets differ considerably from dictionaries where looked-up
keys are more likely to be present. In contrast, sets are primarily
about membership testing where the presence of an element is not known in
advance. Accordingly, the set implementation needs to optimize for both
the found and not-found case.
source on github

Related

Two number sum : why don't anybody do it this way

I was looking for the solution to "two number sum problem" and I saw every body using two for loops
and another way I saw was using a hash table
def twoSumHashing(num_arr, pair_sum):
sums = []
hashTable = {}
for i in range(len(num_arr)):
complement = pair_sum - num_arr[i]
if complement in hashTable:
print("Pair with sum", pair_sum,"is: (", num_arr[i],",",complement,")")
hashTable[num_arr[i]] = num_arr[i]
# Driver Code
num_arr = [4, 5, 1, 8]
pair_sum = 9
# Calling function
twoSumHashing(num_arr, pair_sum)
But why don't nobody discuss about this solution
def two_num_sum(array, target):
for num in array:
match = target - num
if match in array:
return [match, num]
return "no result found"
when using a hash table we have to store values into the hash table. But here there is no need for that.
1)Does that affect the time complexity of the solution?
2)looking up a value in hash table is easy compared to array, but if the values are huge in number,
does storing them in a hash table take more space?
First of all, the second function you provide as a solution is not correct and does not return a complete list of answers.
Second, as a Pythonist, it's better to say dictionary instead of the hash table. A python dictionary is one of the implementations of a hash table.
Anyhow, regarding the other questions that you asked:
Using two for-loops is a brute-force approach and usually is not an optimized approach in real. Dictionaries are way faster than lists in python. So for the sake of time-complexity sure, dictionaries are the winner.
From the point of view of space complexity, using dictionaries for sure takes more memory allocation, but with current hardware, it is not an essential issue for billions of numbers. It depends on your situation, whether the speed is crucial to you or the amount of memory.
first function
uses O(n) time complexity as you iterate over n members in the array
uses O(n) space complexity as you can have one pair which is the first and the last, then in the worst case you can store up to n-1 numbers.
second function
uses O(n^2) time complexity as you iterate first on the array then uses in which uses __contains__ on list which is O(n) in worst case.
So the second function is like doing two loops to brute force the solution.
Another thing to point out in second function is that you don't return all the pairs but just the first pair you find.
Then you can try and fix it by iterating from the index of num+1 but you will have duplicates.
This is all comes down to a preference of what's more important - time complexity or space complexity-
this is one of many interview / preparation to interview question where you need to explain why you would use function two (if was working properly) over function one and vice versa.
Answers for your questions
1.when using a hash table we have to store values into the hash table. But here there is no need for that. 1)Does that affect the time complexity of the solution?
Yes now time complexity is O(n^2) which is worse
2)looking up a value in hash table is easy compared to array, but if the values are huge in number, does storing them in a hash table take more space?
In computers numbers are just representation of bits. Larger numbers can take up more space as they need more bits to represent them but storing them will be the same, no matter where you store.

Why are dict lookups always better than list lookups?

I was using a dictionary as a lookup table but I started to wonder if a list would be better for my application -- the amount of entries in my lookup table wasn't that big. I know lists use C arrays under the hood which made me conclude that lookup in a list with just a few items would be better than in a dictionary (accessing a few elements in an array is faster than computing a hash).
I decided to profile the alternatives but the results surprised me. List lookup was only better with a single element! See the following figure (log-log plot):
So here comes the question: Why do list lookups perform so poorly? What am I missing?
On a side question, something else that called my attention was a little "discontinuity" in the dict lookup time after approximately 1000 entries. I plotted the dict lookup time alone to show it.
p.s.1 I know about O(n) vs O(1) amortized time for arrays and hash tables, but it is usually the case that for a small number of elements iterating over an array is better than to use a hash table.
p.s.2 Here is the code I used to compare the dict and list lookup times:
import timeit
lengths = [2 ** i for i in xrange(15)]
list_time = []
dict_time = []
for l in lengths:
list_time.append(timeit.timeit('%i in d' % (l/2), 'd=range(%i)' % l))
dict_time.append(timeit.timeit('%i in d' % (l/2),
'd=dict.fromkeys(range(%i))' % l))
print l, list_time[-1], dict_time[-1]
p.s.3 Using Python 2.7.13
I know lists use C arrays under the hood which made me conclude that lookup in a list with just a few items would be better than in a dictionary (accessing a few elements in an array is faster than computing a hash).
Accessing a few array elements is cheap, sure, but computing == is surprisingly heavyweight in Python. See that spike in your second graph? That's the cost of computing == for two ints right there.
Your list lookups need to compute == a lot more than your dict lookups do.
Meanwhile, computing hashes might be a pretty heavyweight operation for a lot of objects, but for all ints involved here, they just hash to themselves. (-1 would hash to -2, and large integers (technically longs) would hash to smaller integers, but that doesn't apply here.)
Dict lookup isn't really that bad in Python, especially when your keys are just a consecutive range of ints. All ints here hash to themselves, and Python uses a custom open addressing scheme instead of chaining, so all your keys end up nearly as contiguous in memory as if you'd used a list (which is to say, the pointers to the keys end up in a contiguous range of PyDictEntrys). The lookup procedure is fast, and in your test cases, it always hits the right key on the first probe.
Okay, back to the spike in graph 2. The spike in the lookup times at 1024 entries in the second graph is because for all smaller sizes, the integers you were looking for were all <= 256, so they all fell within the range of CPython's small integer cache. The reference implementation of Python keeps canonical integer objects for all integers from -5 to 256, inclusive. For these integers, Python was able to use a quick pointer comparison to avoid going through the (surprisingly heavyweight) process of computing ==. For larger integers, the argument to in was no longer the same object as the matching integer in the dict, and Python had to go through the whole == process.
The short answer is that lists use linear search and dicts use amortized O(1) search.
In addition, dict searches can skip an equality test either when 1) hash values don't match or 2) when there is an identity match. Lists only benefit from the identity-implies equality optimization.
Back in 2008, I gave a talk on this subject where you'll find all the details: https://www.youtube.com/watch?v=hYUsssClE94
Roughly the logic for searching lists is:
for element in s:
if element is target:
# fast check for identity implies equality
return True
if element == target:
# slower check for actual equality
return True
return False
For dicts the logic is roughly:
h = hash(target)
for i in probe_sequence(h, len(table)):
element = key_table[i]
if element is UNUSED:
raise KeyError(target)
if element is target:
# fast path for identity implies equality
return value_table[i]
if h != h_table[i]:
# unequal hashes implies unequal keys
continue
if element == target:
# slower check for actual equality
return value_table[i]
Dictionary hash tables are typically between one-third and two-thirds full, so they tend to have few collisions (few trips around the loop shown above) regardless of size. Also, the hash value check prevents needless slow equality checks (the chance of a wasted equality check is about 1 in 2**64).
If your timing focuses on integers, there are some other effects at play as well. That hash of a int is the int itself, so hashing is very fast. Also, it means that if you're storing consecutive integers, there tend to be no collisions at all.
You say "accessing a few elements in an array is faster than computing a hash".
A simple hashing rule for strings might be just a sum (with a modulo in the end). This is a branchless operation that can compare favorably with character comparisons, especially when there is a long match on the prefix.

Why is insertion sort so fast compared to other sorting algorithms?

I've been testing the speed of various other sorting algorithms (Selection, Quick, Bubble, Shell, Radix, etc) along with Insertion Sort. However, Insertion Sort seems to be the fastest algorithm by far. I always thought Quick Sort would be the fastest.
Here is my code for the Insertion Sort and Timer functions in Python 3.
def InsertionSort(argShuffledList):
for index in range(1,len(argShuffledList)):
currentvalue = argShuffledList[index]
position = index
while position>0 and argShuffledList[position-1]>currentvalue:
argShuffledList[position]=argShuffledList[position-1]
position = position-1
argShuffledList[position]=currentvalue
return argShuffledList
def Timer(argFunction, *args): ## function for timing functions
dblStart = time.clock()
argFunction(*args)
intTime = "%.2f" % ((time.clock() - dblStart) * 1000000)
message = "Elasped Time: " + str(intTime) + " microseconds"
return message
insertionSortList = InsertionSort(insertionCopyList)
timeInsertionSortList = Timer(InsertionSort, insertionCopyList)
You are sorting the list before timing it. So when you time it, the list is already sorted, and your insertion sort needs to do no insertions, so runs in linear time.
insertionSortList = InsertionSort(insertionCopyList)
timeInsertionSortList = Timer(InsertionSort, insertionCopyList)
It's not clear what the intention of the first call to InsertionSort is, unless you're trying to time already-sorted lists.
The "best" sort for a task is dependent on a number of criteria including data type, number of items, data distribution, etc.
The number of items matters because the simpler O(n^2) sort algorithms have much less overhead than the more process-efficient O(n lg n) algorithms, but the overhead becomes much less significant the more items you have to sort. This is why at least one implementation of qsort which Microsoft shipped with Visual C++ used a quicksort until a given partition had seven or less items, at which point the partition was sorted using an insertion sort.
Insertion sort is faster than some of the other O(n^2) sort algorithms because it has less overhead (especially when compared with bubble sort).
There are also variations of sorting algorithms. For example, a two-partition quicksort has a worse worst-case performance than the more complex three-partition quicksort, especially if the data was already sorted in reverse-order.
Radix sort is a special sort whose performance greatly relies on the data type. It is best used when the data items are fixed-width and this fact can be leveraged, which makes it great for sorting lists of integers but not for sorting lists of general strings. I have seen work someone did to make it efficient for sorting (32-bit) floating-point, but it did heavily leverage the internal storage of the floating-point number.
It actually depends on what type of list you are using: even though Insetrion Sort's worst-case time is O(n2) it is an approximation meaning that it does not apply for small or nearly sorted lists.
Quick sort uses extra overhead because of all the recursive functions. This is why you should privilege Insertion sort over the divide and conquer sorting algorithms like merge sort and quick sort when the problem is small.

Optimize algorithm to compute orbit under a given action in python

My goal is to iterate through a set S of elements given a single element and an action G: S -> S that acts transitively on S (i.e., for any elt,elt' in S, there is a map f in G such that f(elt) = elt'). The action is finitely generated, so I can use that I can apply each generator to a given element.
The algorithm I use is:
def orbit(act,elt):
new_elements = [elt]
seen_elements = set([elt])
yield elt
while new_elements:
elt = new_elements.pop()
seen_elements.add(elt)
for f in act.gens():
elt_new = f(elt)
if elt_new not in seen_elements:
new_elements.append(elt_new)
seen_elements.add(elt_new)
yield elt_new
This algorithm seems to be well-suited and very generic. BUT it has one major and one minor slowdown in big computations that I would like to get rid of:
The major: seen_elements collects all the elements, and is thus too memory consuming, given that I do not need the actual elements anymore.
How can I achieve to not have all the elements stored in memory?
Very likely, this depends on what the elements are. So for me, these are short lists (<10 entries) of ints (each < 10^3). So first, is there a fast way to associate a (with high probability) unique integer to such a list? Does that save much memory? If so, should I put those into a dict to check the containment (in this case, first the hash equality test, and then an int equality test are done, right?), or how should I do that?
the minor: poping the element takes a lot of time given that I don't quite need that list. Is there a better way of doing that?
Thanks a lot for your suggestions!
So first, is there a fast way to associate a (with high probability) unique integer to such a list?
If the list entries all are in range(1, 1024), then sum(x << (i * 10) for i, x in enumerate(elt)) yields a unique integer.
Does that save much memory?
The short answer is yes. The long answer is that it's complicated to determine how much. Python's long integer representation uses (probably) 30-bit digits, so the digits will pack 3 to the 32-bit word instead of 1 (or 0.5 for 64-bit). There's some object overhead (8/16 bytes?), and then there's the question of how many of the list entries require separate objects, which is where the big win may lie.
If you can tolerate errors, then a Bloom filter would be a possibility.
the minor: popping the element takes a lot of time given that I don't quite need that list. Is there a better way of doing that?
I find that claim surprising. Have you measured?

C data structures

Is there a C data structure equatable to the following python structure?
data = {'X': 1, 'Y': 2}
Basically I want a structure where I can give it an pre-defined string and have it come out with an integer.
The data-structure you are looking for is called a "hash table" (or "hash map"). You can find the source code for one here.
A hash table is a mutable mapping of an integer (usually derived from a string) to another value, just like the dict from Python, which your sample code instantiates.
It's called a "hash table" because it performs a hash function on the string to return an integer result, and then directly uses that integer to point to the address of your desired data.
This system makes it extremely extremely quick to access and change your information, even if you have tons of it. It also means that the data is unordered because a hash function returns a uniformly random result and puts your data unpredictable all over the map (in a perfect world).
Also note that if you're doing a quick one-off hash, like a two or three static hash for some lookup: look at gperf, which generates a perfect hash function and generates simple code for that hash.
The above data structure is a dict type.
In C/C++ paralance, a hashmap should be equivalent, Google for hashmap implementation.
There's nothing built into the language or standard library itself but, depending on your requirements, there are a number of ways to do it.
If the data set will remain relatively small, the easiest solution is to probably just have an array of structures along the lines of:
typedef struct {
char *key;
int val;
} tElement;
then use a sequential search to look them up. Have functions which insert keys, delete keys and look up keys so that, if you need to change it in future, the API itself won't change. Pseudo-code:
def init:
create g.key[100] as string
create g.val[100] as integer
set g.size to 0
def add (key,val):
if lookup(key) != not_found:
return already_exists
if g.size == 100:
return no_space
g.key[g.size] = key
g.val[g.size] = val
g.size = g.size + 1
return okay
def del (key):
pos = lookup (key)
if pos == not_found:
return no_such_key
if pos < g.size - 1:
g.key[pos] = g.key[g.size-1]
g.val[pos] = g.val[g.size-1]
g.size = g.size - 1
def find (key):
for pos goes from 0 to g.size-1:
if g.key[pos] == key:
return pos
return not_found
Insertion means ensuring it doesn't already exist then just tacking an element on to the end (you'll maintain a separate size variable for the structure). Deletion means finding the element then simply overwriting it with the last used element and decrementing the size variable.
Now this isn't the most efficient method in the world but you need to keep in mind that it usually only makes a difference as your dataset gets much larger. The difference between a binary tree or hash and a sequential search is irrelevant for, say, 20 entries. I've even used bubble sort for small data sets where a more efficient one wasn't available. That's because it massively quick to code up and the performance is irrelevant.
Stepping up from there, you can remove the fixed upper size by using a linked list. The search is still relatively inefficient since you're doing it sequentially but the same caveats apply as for the array solution above. The cost of removing the upper bound is a slight penalty for insertion and deletion.
If you want a little more performance and a non-fixed upper limit, you can use a binary tree to store the elements. This gets rid of the sequential search when looking for keys and is suited to somewhat larger data sets.
If you don't know how big your data set will be getting, I would consider this the absolute minimum.
A hash is probably the next step up from there. This performs a function on the string to get a bucket number (usually treated as an array index of some sort). This is O(1) lookup but the aim is to have a hash function that only allocates one item per bucket, so that no further processing is required to get the value.
A degenerate case of "all items in the same bucket" is no different to an array or linked list.
For maximum performance, and assuming the keys are fixed and known in advance, you can actually create your own hashing function based on the keys themselves.
Knowing the keys up front, you have extra information that allows you to fully optimise a hashing function to generate the actual value so you don't even involve buckets - the value generated by the hashing function can be the desired value itself rather than a bucket to get the value from.
I had to put one of these together recently for converting textual months ("January", etc) in to month numbers. You can see the process here.
I mention this possibility because of your "pre-defined string" comment. If your keys are limited to "X" and "Y" (as in your example) and you're using a character set with contiguous {W,X,Y} characters (which even covers EBCDIC as well as ASCII though not necessarily every esoteric character set allowed by ISO), the simplest hashing function would be:
char *s = "X";
int val = *s - 'W';
Note that this doesn't work well if you feed it bad data. These are ideal for when the data is known to be restricted to certain values. The cost of checking data can often swamp the saving given by a pre-optimised hash function like this.
C doesn't have any collection classes. C++ has std::map.
You might try searching for C implementations of maps, e.g. http://elliottback.com/wp/hashmap-implementation-in-c/
A 'trie' or a 'hasmap' should do. The simplest implementation is an array of struct { char *s; int i }; pairs.
Check out 'trie' in 'include/nscript.h' and 'src/trie.c' here: http://github.com/nikki93/nscript . Change the 'trie_info' type to 'int'.
Try a Trie for strings, or a Tree of some sort for integer/pointer types (or anything that can be compared as "less than" or "greater than" another key). Wikipedia has reasonably good articles on both, and they can be implemented in C.

Categories

Resources