I am looking for a Python datastructure that functions as a sorted list that has the following asymptotics:
O(1) pop from beginning (pop smallest element)
O(1) pop from end (pop largest element)
>= O(log n) insert
Does such a datastructure with an efficient implementation exist? If so, is there a library that implements it in Python?
A regular red/black tree or B-tree can do this in an amortized sense. If you store pointers to the smallest and biggest elements of the tree, then the cost of deleting those elements is amortized O(1), meaning that any series of d deletions will take time O(d), though individual deletions may take longer than this. The cost of insertions are O(log n), which is as good as possible because otherwise you could sort n items in less than O(n log n) time with your data structure.
As for libraries that implement this - that I’m not sure of.
I am having problems trying to find the Big-O runtime of this. It's building a heap by calling the insert function to insert the elements into the heap.
buildHeap(A)
h = new empty heap
for each element e in A
h.insert(e)
What is the Big-O runtime of this version of buildHeap?
Written this way, for a typical binary heap, it would be O(n log n); you're inserting one at a time, and each insertion is O(log n). There are optimized ways to build a heap an array of elements all at once from n elements in O(n) time (referred to as the "heapify" operation), but it's not done by repeated single-element insertions.
The big-O could change depending on the type of heap; some variant heap designs have O(1) insertion, though of course the come with other trade-offs that differ by type, e.g. memory fragmentation, complexity of implementation, higher fixed costs per operation, etc.
There's already a question regarding this, and the answer says that the asymptotic complexity is O(n). But I observed that if an unsorted list is converted into a set, the set can be printed out in a sorted order, which means that at some point in the middle of these operations the list has been sorted. Then, as any comparison sort has the lower bound of Omega(n lg n), the asymptotic complexity of this operation should also be Omega(n lg n). So what exactly is the complexity of this operation?
A set in Python is an unordered collection so any order you see is by chance. As both dict and set are implemented as hash tables in CPython, insertion is average case O(1) and worst case O(N).
So list(set(...)) is always O(N) and set(list(...)) is average case O(N).
You can browse the source code for set here.
What space complexity does the python sort take? I can't find any definitive documentation on this anywhere
Space complexity is defined as how much additional space the algorithm needs in terms of the N elements. And even though according to the docs, the sort method sorts a list in place, it does use some additional space, as stated in the description of the implementation:
timsort can require a temp array containing as many as N//2 pointers, which means as many as 2*N extra bytes on 32-bit boxes. It can be expected to require a temp array this large when sorting random data; on data with significant structure, it may get away without using any extra heap memory.
Therefore the worst case space complexity is O(N) and best case O(1)
Python's built in sort method is a spin off of merge sort called Timsort, more information here - https://en.wikipedia.org/wiki/Timsort.
It's essentially no better or worse than merge sort, which means that its run time on average is O(n log n) and its space complexity is Ω(n)
I need a collection data-structure that can do the following:
Be sorted
Allow me to quickly pop values off the front and back of the list O(log n)
Remain sorted after I insert a new value
Allow a user-specified comparison function, as I will be storing tuples and want to sort on a particular value
Thread-safety is not required
Optionally allow efficient haskey() lookups (I'm happy to maintain a separate hash-table for this though)
My thoughts at this stage are that I need a priority queue and a hash table, although I don't know if I can quickly pop values off both ends of a priority queue. Another possibility is simply maintaining an OrderedDictionary and doing an insertion sort it every-time I add more data to it.
Because I'm interested in performance for a moderate number of items (I would estimate less than 200,000), I am unsure about what asymptotic performance I require for these operations. n will not grow infinitely, so a low constant performance k in k * O(n) may be as important O(n). That said, I would prefer that both the insert and pop operations take O(log n) time.
Furthermore, are there any particular implementations in Python? I would really like to avoid writing this code myself.
You might get good performance for these kinds of operations using blist or a database (such as the sqlite which is in the stdlib).
I suggest some sort of balanced binary tree such as a red-black tree.
A search on PyPi throws up a couple of implementations. Searching on google will give you more.
bintrees on PyPi looks very complete and has both Python and C/Cython implementations. I have not used it though, so caveat emptor.
A red-black tree is kept sorted and most operations (insert, delete, find) are O(log2(N)), so finding an element in a tree of 200,000 entries will take on average 17-18 comparisons.
Sounds like a skip list will fulfill all your requirements. It's basically a dynamically-sized sorted linked list, with O(log n) insertions and removals.
I don't really know Python, but this link seems to be relevant:
http://infohost.nmt.edu/tcc/help/lang/python/examples/pyskip/
I presume you need it sorted because you access element by rank in the sorted order?
You can use any implementation of any balanced binary tree, with the additional information at each node which tells you the numbers of descendants of that node (usually called the Order Statistic Binary Tree).
With this structure, given the rank of an element (even min/max), you can access/delete it in O(log n) time.
This makes all operations (access/insert/delete by rank, pop front/back, insert/delete/search by value) O(logn) time, while allowing custom sort methods.
Also, apparently python has an AVL tree (one of the first balanced tree structures) implementation which supports order statistics: http://www.python.org/ftp/python/contrib-09-Dec-1999/DataStructures/avl.README
So you won't need a custom implementation.
Except for the hashing, what you're looking for is a double-ended priority queue, aka a priority deque.
If your need for sorting doesn't extend beyond managing the min and max of your data, another structure for you to look at might be an interval heap, which has the advantage of O(1) lookup of both min and max if you need to peek at values (though deleteMin and deleteMax are still O(log(N)) ). Unfortunately, I'm not aware of any implementations in Python, so I think you'd have to roll your own.
Here's an addendum to an algorithms textbook that describes interval heaps if you're interested:
http://www.mhhe.com/engcs/compsci/sahni/enrich/c9/interval.pdf
If you can really allow O(log n) for pop, dequeue, and insert, then a simple balanced search tree like red-black tree is definitely sufficient.
You can optimize this of course by maintaining a direct pointer to the smallest and largest element in the tree, and then updating it when you (1) insert elements into the tree or (2) pop or dequeue, which of course invalidate the resp. pointer. But because the tree is balanced, there's some shuffling going out anyway, and you can update the corr. pointer at the same time.
There is also something called min-max heap (see the Wikipedia entry for Binary Heap), which implements exactly a "double-ended priority queue", i.e. a queue where you can pop both from front end and the rear end. However there you can't access the whole list of objects in order, whereas a search tree can be iterated efficiently through in O(n) time.
The benefit of a min-max heap however is that the current min and max objects can be read in O(1) time, a search tree requires O(log(n)) just to read the min or max object unless you have the cached pointers as I mentioned above.
If this were Java I'd use a TreeSet with the NavigableSet interface.
This is implemented as a Red-Black-Tree.