Python heapq vs. sorted complexity and performance

Python heapq vs. sorted complexity and performance - python

I'm relatively new to python (using v3.x syntax) and would appreciate notes regarding complexity and performance of heapq vs. sorted.
I've already implemented a heapq based solution for a greedy 'find the best job schedule' algorithm. But then I've learned about the possibility of using 'sorted' together with operator.itemgetter() and reverse=True.
Sadly, I could not find any explanation on expected complexity and/or performance of 'sorted' vs. heapq.

If you use binary heap to pop all elements in order, the thing you do is basically heapsort. It is slower than sort algorightm in sorted function apart from it's implementation is pure python.
The heapq is faster than sorted in case if you need to add elements on the fly i.e. additions and insertions could come in unspecified order. Adding new element preserving inner order in any heap is faster than resorting array after each insertion.
The sorted is faster if you will need to retrieve all elements in order later.
The only problem where they can compete - if you need some portion of smallest (or largest) elements from collection. Although there are special algorigthms for that case, whether heapq or sorted will be faster here depends on the size of the initial array and portion you'll need to extract.

The nlargest() and nsmallest() functions of heapq are most appropriate if you are trying to find a relatively small number of items. If you want to find simply single smallest or largest number , min() and max() are most suitable, because it's faster and uses sorted and then slicing. If you are looking for the N smallest or largest items and N is small compared to the overall size of the collection, these functions provide superior performance. Although it's not necessary to use heapq in your code, it's just an interesting topic and a worthwhile subject of study.

heapq is implemented as a binary heap,
The key things to note about binary heaps, and by extension, heapq:
Searching is not supported
Insertions are constant time on average
Deletions are O(log n) time on average
Additional binary heap info described here: http://en.wikipedia.org/wiki/Binary_heap
While heapq is a data structure which has the properties of a binary heap, using sorted is a different concept. sorted returns a sorted list, so that's essentially a result, whereas the heapq is a data structure you are continually working with, which could, optionally, be sorted via sorted.
Additonal sorted info here: https://docs.python.org/3.4/library/functions.html#sorted
What specifically are you trying to accomplish?
Response to OP's comment:
Why do you think you need a heapq specifically? A binary heap is a specialized data structure, and depending on your requirements, it's quite likely not necessary.
You seem to be extremely concerned about performance, but it's not clear why. If something is a "bad performer", but its aggregate time is not significant, then it really doesn't matter in the bigger picture. In the aggregate case, a dict or a list would perform generally perform fine. Why do you specifically think a heapq is needed?
I wonder if this is a don't-let-the-perfect-be-the-enemy-of-the-good type of situation.
Writing Python using C extensions is a niche use case reserved for cases where performance is truly a significant issue. (i.e. it may be better to use, say, an XML parser that is a C extension than something that is pure Python if you're dealing with large files and if performance is your main concern).
Regarding In complex keep playing with structure case: could it be faster to sort with sorted and add elements via .append():
I'm still not clear what the use case is here. As I mentioned above, sorted and heapq are really two different concepts.
What is the use case for which you are so concerned about performance? (Absent other factors not yet specified, I think you may be overly emphasizing the importance of best-case performance in your code here.)

Related

Best / most pythonic way to remove duplicate from the a list and sort in reverse order

I'm trying to take a list (orig_list below), and return a list (new_list below) which:
does not contain duplicate items (i.e. contains only unique elements)
is sorted in reverse order
Here is what I have so far, which seems... I'm going to say "weird," though I'm sure there is a better way to say that. I'm mostly put off by using list() twice for what seems pretty straightforward, and then I'm wondering about the efficiency of this approach.
new_list = list(reversed(sorted(list(set(orig_list)))))
Question #1 (SO-style question):
Are the following propositions correct?
There is no more efficient way to get unique elements of a list than converting the list to a set and back.
Since sets are unordered in Python one must (1) convert to a set before removing duplicate items because otherwise you'd lose the sort anyway, and (2) you have to convert back to a list before you sort.
Using list(reversed()) is programatically equivalent to using list.sort(reversed=True).
Question #2 (bonus):
Are there any ways to achieve the same result in fewer Os, or using a less verbose approach? If so, what is an / are some example(s)?

sorted(set(orig_list), reverse=True)
Shortest in code, more efficient, same result.
Depending on the size, it may or may not be faster to sort first then dedupe in linear time as user2864740 suggests in comments. (The biggest drawback to that approach is it would be entirely in Python, while the above line executes mostly in native code.)
Your questions:
You do not need to convert from set to list and back. sorted accepts any iterable, so set qualifies, and spits out a list, so no post-conversion needed.
reversed(sorted(x)) is not equivalent to sorted(x, reverse=True). You get the same result, but slower - sort is of same speed whether forward or reverse, so reversed is adding an extra operation that is not needed if you sort to the proper ordering from the start.

You've got a few mildly wasteful steps in here, but your proposition is largely correct. The only real improvements to be made are to get rid of all the unnecessary temporary lists:
new_list = sorted(set(orig_list), reverse=True)
sorted already converts its input to a list (so no need to listify before passing to sorted), and you can have it directly produce the output list sorted in reverse (so no need to produce a list only to make a copy of it in reverse).
The only conceivable improvement on big-O time is if you know the data is already sorted, in which case you can avoid O(n log n) sorting, and uniqify without losing the existing sorted order by using itertools.groupby:
new_list = [key for key, grp in itertools.groupby(orig_list)]
If orig_list is sorted in forward order, you can make the result of this reversed at essentially no cost by changing itertools.groupby(orig_list) to itertools.groupby(reversed(orig_list)).
The groupby solution isn't really practical for initially unsorted inputs, because if duplicates are even remotely common, removing them via uniquification as a O(n) step is almost always worth it, as it reduces the n in the more costly O(n log n) sorting step. groupby is also a relatively slow tool; the nature of the implementation using a bunch of temporary iterators for each group, internal caching of values, etc., means that it's a slower O(n) in practice than the O(n) uniquification via set, with its primary advantage being the streaming aspect (making it scale to data sets streamed from disk or the network and back without storing anything for the long term, where set must pull everything into memory).
The other reason to use sorted+groupby would be if your data wasn't hashable, but was comparable; in that case, set isn't an option, so the only choice is sorting and grouping.

What is the space complexity of the python sort?

What space complexity does the python sort take? I can't find any definitive documentation on this anywhere

Space complexity is defined as how much additional space the algorithm needs in terms of the N elements. And even though according to the docs, the sort method sorts a list in place, it does use some additional space, as stated in the description of the implementation:
timsort can require a temp array containing as many as N//2 pointers, which means as many as 2*N extra bytes on 32-bit boxes. It can be expected to require a temp array this large when sorting random data; on data with significant structure, it may get away without using any extra heap memory.
Therefore the worst case space complexity is O(N) and best case O(1)

Python's built in sort method is a spin off of merge sort called Timsort, more information here - https://en.wikipedia.org/wiki/Timsort.
It's essentially no better or worse than merge sort, which means that its run time on average is O(n log n) and its space complexity is Ω(n)

Efficiency of Python sorted() built-in function vs. list insert() method

I am not new to it, but I do not use Python much and my knowledge is rather broad and not very deep in the language, perhaps someone here more knowledgeable can answer my question. I find myself in the situation when I need to add items to a list and keep it sorted as items as added. A quick way of doing this would be.
list.append(item) // O(1)
list.sort() // ??
I would imagine if this is the only way items are added to the list, I would hope the sort would be rather efficient, because the list is sorted with each addition. However there is also this that works:
inserted = False
for i in range(len(list)): // O(N)
if (item < list[i]):
list.insert(i, item) // ??
inserted = True
break
if not inserted: list.append(item)
Can anyone tell me if one of these is obviously more efficient? I am leaning toward the second set of statements, however I really have no idea.

What you are looking for is the bisect module and most possible the insort_left
So your expression could be equivalently written as
from
some_list.append(item) // O(1)
some_list.sort() // ??
to
bisect.insort_left(some_list, item)

Insertion anywhere except near the end takes O(n) time, as it has to move (copy) all elements which come after the insertion point. But on the other hand, all comparison-based sorting algorithm must, on average, make Omega(n log n) comparisons. Many sorts (including timsort, which Python uses) will do significantly better on many inputs, likely including yours (the "almost sorted" case). They still have to move at least that as many elements as inserting in the right position right away though. They also have to do quite a bit of additional work (inspecting all element to ensure they're in the right order, plus more complicated logic that often improves performance, but not in your case). For these reasons, it's probably slower, at least for large lists.
Due to being written in C (in CPython; but similar reasoning applies for other Pythons), it may still be faster than your Python-written linear scan. That leaves the question of how to find the insertion point. Binary search can do this part in O(log n) time, so it's quite useful here (of course, insertion is still O(n), but there's no way around this if you want a sorted list). Unfortunately, binary search is rather tricky to implement. Fortunately, it's already implemented in the standard library: bisect.

What makes sets faster than lists?

The python wiki says: "Membership testing with sets and dictionaries is much faster, O(1), than searching sequences, O(n). When testing "a in b", b should be a set or dictionary instead of a list or tuple."
I've been using sets in place of lists whenever speed is important in my code, but lately I've been wondering why sets are so much faster than lists. Could anyone explain, or point me to a source that would explain, what exactly is going on behind the scenes in python to make sets faster?

list: Imagine you are looking for your socks in your closet, but you don't know in which drawer your socks are, so you have to search drawer by drawer until you find them (or maybe you never do). That's what we call O(n), because in the worst scenario, you will look in all your drawers (where n is the number of drawers).
set: Now, imagine you're still looking for your socks in your closet, but now you know in which drawer your socks are, say in the 3rd drawer. So, you will just search in the 3rd drawer, instead of searching in all drawers. That's what we call O(1), because in the worst scenario you will look in just one drawer.

Sets are implemented using hash tables. Whenever you add an object to a set, the position within the memory of the set object is determined using the hash of the object to be added. When testing for membership, all that needs to be done is basically to look if the object is at the position determined by its hash, so the speed of this operation does not depend on the size of the set. For lists, in contrast, the whole list needs to be searched, which will become slower as the list grows.
This is also the reason that sets do not preserve the order of the objects you add.
Note that sets aren't faster than lists in general -- membership test is faster for sets, and so is removing an element. As long as you don't need these operations, lists are often faster.

I think you need to take a good look at a book on data structures. Basically, Python lists are implemented as dynamic arrays and sets are implemented as a hash tables.
The implementation of these data structures gives them radically different characteristics. For instance, a hash table has a very fast lookup time but cannot preserve the order of insertion.

While I have not measured anything performance related in python so far, I'd still like to point out that lists are often faster.
Yes, you have O(1) vs. O(n). But always remember that this gives information only about the asymptotic behavior of something. That means if your n is very high O(1) will always be faster - theoretically. In practice however n often needs to be much bigger than your usual data set will be.
So sets are not faster than lists per se, but only if you have to handle a lot of elements.

Python uses hashtables, which have O(1) lookup.

Basically, Depends on the operation you are doing …
*For adding an element - then a set doesn’t need to move any data, and all it needs to do is calculate a hash value and add it to a table. For a list insertion then potentially there is data to be moved.
*For deleting an element - all a set needs to do is remove the hash entry from the hash table, for a list it potentially needs to move data around (on average 1/2 of the data.
*For a search (i.e. an in operator) - a set just needs to calculate the hash value of the data item, find that hash value in the hash table, and if it is there - then bingo. For a list, the search has to look up each item in turn - on average 1/2 of all of the terms in the list. Even for many 1000s of items a set will be far quicker to search.

Actually sets are not faster than lists in every scenario. Generally the lists are faster than sets. But in the case of searching for an element in a collection, sets are faster because sets have been implemented using hash tables. So basically Python does not have to search the full set, which means that the time complexity in average is O(1). Lists use dynamic arrays and Python needs to check the full array to search. So it takes O(n).
So finally we can see that sets are better in some case and lists are better in some cases. Its up to us to select the appropriate data structure according to our task.

A list must be searched one by one, where a set or dictionary has an index for faster searching.

Best data-structure to use for two ended sorted list

I need a collection data-structure that can do the following:
Be sorted
Allow me to quickly pop values off the front and back of the list O(log n)
Remain sorted after I insert a new value
Allow a user-specified comparison function, as I will be storing tuples and want to sort on a particular value
Thread-safety is not required
Optionally allow efficient haskey() lookups (I'm happy to maintain a separate hash-table for this though)
My thoughts at this stage are that I need a priority queue and a hash table, although I don't know if I can quickly pop values off both ends of a priority queue. Another possibility is simply maintaining an OrderedDictionary and doing an insertion sort it every-time I add more data to it.
Because I'm interested in performance for a moderate number of items (I would estimate less than 200,000), I am unsure about what asymptotic performance I require for these operations. n will not grow infinitely, so a low constant performance k in k * O(n) may be as important O(n). That said, I would prefer that both the insert and pop operations take O(log n) time.
Furthermore, are there any particular implementations in Python? I would really like to avoid writing this code myself.

You might get good performance for these kinds of operations using blist or a database (such as the sqlite which is in the stdlib).

I suggest some sort of balanced binary tree such as a red-black tree.
A search on PyPi throws up a couple of implementations. Searching on google will give you more.
bintrees on PyPi looks very complete and has both Python and C/Cython implementations. I have not used it though, so caveat emptor.
A red-black tree is kept sorted and most operations (insert, delete, find) are O(log2(N)), so finding an element in a tree of 200,000 entries will take on average 17-18 comparisons.

Sounds like a skip list will fulfill all your requirements. It's basically a dynamically-sized sorted linked list, with O(log n) insertions and removals.
I don't really know Python, but this link seems to be relevant:
http://infohost.nmt.edu/tcc/help/lang/python/examples/pyskip/

I presume you need it sorted because you access element by rank in the sorted order?
You can use any implementation of any balanced binary tree, with the additional information at each node which tells you the numbers of descendants of that node (usually called the Order Statistic Binary Tree).
With this structure, given the rank of an element (even min/max), you can access/delete it in O(log n) time.
This makes all operations (access/insert/delete by rank, pop front/back, insert/delete/search by value) O(logn) time, while allowing custom sort methods.
Also, apparently python has an AVL tree (one of the first balanced tree structures) implementation which supports order statistics: http://www.python.org/ftp/python/contrib-09-Dec-1999/DataStructures/avl.README
So you won't need a custom implementation.

Except for the hashing, what you're looking for is a double-ended priority queue, aka a priority deque.
If your need for sorting doesn't extend beyond managing the min and max of your data, another structure for you to look at might be an interval heap, which has the advantage of O(1) lookup of both min and max if you need to peek at values (though deleteMin and deleteMax are still O(log(N)) ). Unfortunately, I'm not aware of any implementations in Python, so I think you'd have to roll your own.
Here's an addendum to an algorithms textbook that describes interval heaps if you're interested:
http://www.mhhe.com/engcs/compsci/sahni/enrich/c9/interval.pdf

If you can really allow O(log n) for pop, dequeue, and insert, then a simple balanced search tree like red-black tree is definitely sufficient.
You can optimize this of course by maintaining a direct pointer to the smallest and largest element in the tree, and then updating it when you (1) insert elements into the tree or (2) pop or dequeue, which of course invalidate the resp. pointer. But because the tree is balanced, there's some shuffling going out anyway, and you can update the corr. pointer at the same time.
There is also something called min-max heap (see the Wikipedia entry for Binary Heap), which implements exactly a "double-ended priority queue", i.e. a queue where you can pop both from front end and the rear end. However there you can't access the whole list of objects in order, whereas a search tree can be iterated efficiently through in O(n) time.
The benefit of a min-max heap however is that the current min and max objects can be read in O(1) time, a search tree requires O(log(n)) just to read the min or max object unless you have the cached pointers as I mentioned above.

If this were Java I'd use a TreeSet with the NavigableSet interface.
This is implemented as a Red-Black-Tree.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.