Python heapq implementation

Python heapq implementation - python

heapq documentation is:
These two make it possible to view the heap as a regular Python list without surprises: heap[0] is the smallest item, and heap.sort() maintains the heap invariant!
So is heapq implementation really just heap.sort() after every push/pop, or is it implemented as a traditional min heap queue (which would make sense, since it would be O(log(n)) instead of O(nlog(n)) for pop and push)?

Firstly, heappush() and heappop() in heapq library is definitely O(log(n)).
Secondly, heap.sort() would sort the items in increasing order. Which would mean the min-heap rule that parent value is always less than the value of children is still maintained.
heapq implementation is definitely not heap.sort() after every push() and pop() because that would be O(nlog(n)) and suboptimal to the O(log(n)) it provides. For more information take a look at https://github.com/python/cpython/blob/master/Lib/heapq.py

Related

Deque (time complexity)

What is the time complexity for accessing deque[0], deque[somewhere in the middle], and deque[-1]?

From the documentation:
Indexed access is O(1) at both ends but slows to O(n) in the middle. For fast random access, use lists instead.
This suggests that the implementation is a doubly-linked list.

Time complexity of operations in SortedList - Python

What is the time complexity of operations in SortedList implementation of sortedcontainers module?
As I understand, the underlying data structure is an array list. So does insertion takes O(n) time since the index can be found in O(logn) and then insert the element at the correct location is O(n)?
Similarly, popping an element from an index must be O(n) as well.

Insert, remove, get index, bisect right and left, find element inside list, are all log(n) operations. Its similar to treeset and multiset in java and c++, implemented with AVL tree or red black tree.

What is the algorithmic complexity of converting a collections.deque to python list?

I'm trying to determine the complexity of converting a collections.deque object into a python list object is O(n). I imagine it would have to take every element and convert it into the list, but I cannot seem to find the implementation code behind deque. So has Python built in something more efficient behind the hood that could allow for O(1) conversion to a list?
Edit: Based off the following I do not believe it could be any faster than O(n)
"Indexed access is O(1) at both ends but slows to O(n) in the middle. For fast random access, use lists instead."
If it cannot access a middle node in O(1) time it will not be able to convert without the same complexity.

You have to access every node. O(1) time is impossible for that fact alone.
I would believe that a deque follows the same principles as conventional deques, in that it's constant time to access the first element. You have to do that for n elements, so the runtime to do so would be O(n).

Here is the implementation of deque
However, that is irrelevant for determining complexity to convert a deque to list in python.
If python is not reusing the data structure internally somehow, conversion into a list will require a walk through the deque and it will be O(n).

Python heapq vs. sorted complexity and performance

I'm relatively new to python (using v3.x syntax) and would appreciate notes regarding complexity and performance of heapq vs. sorted.
I've already implemented a heapq based solution for a greedy 'find the best job schedule' algorithm. But then I've learned about the possibility of using 'sorted' together with operator.itemgetter() and reverse=True.
Sadly, I could not find any explanation on expected complexity and/or performance of 'sorted' vs. heapq.

If you use binary heap to pop all elements in order, the thing you do is basically heapsort. It is slower than sort algorightm in sorted function apart from it's implementation is pure python.
The heapq is faster than sorted in case if you need to add elements on the fly i.e. additions and insertions could come in unspecified order. Adding new element preserving inner order in any heap is faster than resorting array after each insertion.
The sorted is faster if you will need to retrieve all elements in order later.
The only problem where they can compete - if you need some portion of smallest (or largest) elements from collection. Although there are special algorigthms for that case, whether heapq or sorted will be faster here depends on the size of the initial array and portion you'll need to extract.

The nlargest() and nsmallest() functions of heapq are most appropriate if you are trying to find a relatively small number of items. If you want to find simply single smallest or largest number , min() and max() are most suitable, because it's faster and uses sorted and then slicing. If you are looking for the N smallest or largest items and N is small compared to the overall size of the collection, these functions provide superior performance. Although it's not necessary to use heapq in your code, it's just an interesting topic and a worthwhile subject of study.

heapq is implemented as a binary heap,
The key things to note about binary heaps, and by extension, heapq:
Searching is not supported
Insertions are constant time on average
Deletions are O(log n) time on average
Additional binary heap info described here: http://en.wikipedia.org/wiki/Binary_heap
While heapq is a data structure which has the properties of a binary heap, using sorted is a different concept. sorted returns a sorted list, so that's essentially a result, whereas the heapq is a data structure you are continually working with, which could, optionally, be sorted via sorted.
Additonal sorted info here: https://docs.python.org/3.4/library/functions.html#sorted
What specifically are you trying to accomplish?
Response to OP's comment:
Why do you think you need a heapq specifically? A binary heap is a specialized data structure, and depending on your requirements, it's quite likely not necessary.
You seem to be extremely concerned about performance, but it's not clear why. If something is a "bad performer", but its aggregate time is not significant, then it really doesn't matter in the bigger picture. In the aggregate case, a dict or a list would perform generally perform fine. Why do you specifically think a heapq is needed?
I wonder if this is a don't-let-the-perfect-be-the-enemy-of-the-good type of situation.
Writing Python using C extensions is a niche use case reserved for cases where performance is truly a significant issue. (i.e. it may be better to use, say, an XML parser that is a C extension than something that is pure Python if you're dealing with large files and if performance is your main concern).
Regarding In complex keep playing with structure case: could it be faster to sort with sorted and add elements via .append():
I'm still not clear what the use case is here. As I mentioned above, sorted and heapq are really two different concepts.
What is the use case for which you are so concerned about performance? (Absent other factors not yet specified, I think you may be overly emphasizing the importance of best-case performance in your code here.)

Best data-structure to use for two ended sorted list

I need a collection data-structure that can do the following:
Be sorted
Allow me to quickly pop values off the front and back of the list O(log n)
Remain sorted after I insert a new value
Allow a user-specified comparison function, as I will be storing tuples and want to sort on a particular value
Thread-safety is not required
Optionally allow efficient haskey() lookups (I'm happy to maintain a separate hash-table for this though)
My thoughts at this stage are that I need a priority queue and a hash table, although I don't know if I can quickly pop values off both ends of a priority queue. Another possibility is simply maintaining an OrderedDictionary and doing an insertion sort it every-time I add more data to it.
Because I'm interested in performance for a moderate number of items (I would estimate less than 200,000), I am unsure about what asymptotic performance I require for these operations. n will not grow infinitely, so a low constant performance k in k * O(n) may be as important O(n). That said, I would prefer that both the insert and pop operations take O(log n) time.
Furthermore, are there any particular implementations in Python? I would really like to avoid writing this code myself.

You might get good performance for these kinds of operations using blist or a database (such as the sqlite which is in the stdlib).

I suggest some sort of balanced binary tree such as a red-black tree.
A search on PyPi throws up a couple of implementations. Searching on google will give you more.
bintrees on PyPi looks very complete and has both Python and C/Cython implementations. I have not used it though, so caveat emptor.
A red-black tree is kept sorted and most operations (insert, delete, find) are O(log2(N)), so finding an element in a tree of 200,000 entries will take on average 17-18 comparisons.

Sounds like a skip list will fulfill all your requirements. It's basically a dynamically-sized sorted linked list, with O(log n) insertions and removals.
I don't really know Python, but this link seems to be relevant:
http://infohost.nmt.edu/tcc/help/lang/python/examples/pyskip/

I presume you need it sorted because you access element by rank in the sorted order?
You can use any implementation of any balanced binary tree, with the additional information at each node which tells you the numbers of descendants of that node (usually called the Order Statistic Binary Tree).
With this structure, given the rank of an element (even min/max), you can access/delete it in O(log n) time.
This makes all operations (access/insert/delete by rank, pop front/back, insert/delete/search by value) O(logn) time, while allowing custom sort methods.
Also, apparently python has an AVL tree (one of the first balanced tree structures) implementation which supports order statistics: http://www.python.org/ftp/python/contrib-09-Dec-1999/DataStructures/avl.README
So you won't need a custom implementation.

Except for the hashing, what you're looking for is a double-ended priority queue, aka a priority deque.
If your need for sorting doesn't extend beyond managing the min and max of your data, another structure for you to look at might be an interval heap, which has the advantage of O(1) lookup of both min and max if you need to peek at values (though deleteMin and deleteMax are still O(log(N)) ). Unfortunately, I'm not aware of any implementations in Python, so I think you'd have to roll your own.
Here's an addendum to an algorithms textbook that describes interval heaps if you're interested:
http://www.mhhe.com/engcs/compsci/sahni/enrich/c9/interval.pdf

If you can really allow O(log n) for pop, dequeue, and insert, then a simple balanced search tree like red-black tree is definitely sufficient.
You can optimize this of course by maintaining a direct pointer to the smallest and largest element in the tree, and then updating it when you (1) insert elements into the tree or (2) pop or dequeue, which of course invalidate the resp. pointer. But because the tree is balanced, there's some shuffling going out anyway, and you can update the corr. pointer at the same time.
There is also something called min-max heap (see the Wikipedia entry for Binary Heap), which implements exactly a "double-ended priority queue", i.e. a queue where you can pop both from front end and the rear end. However there you can't access the whole list of objects in order, whereas a search tree can be iterated efficiently through in O(n) time.
The benefit of a min-max heap however is that the current min and max objects can be read in O(1) time, a search tree requires O(log(n)) just to read the min or max object unless you have the cached pointers as I mentioned above.

If this were Java I'd use a TreeSet with the NavigableSet interface.
This is implemented as a Red-Black-Tree.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.