I've read that Python's lists are implemented using pointers. I then see this module http://docs.python.org/2/library/bisect.html which does efficient insertion into a sorted list. How does it do that efficiently? If the list is implemented using pointers and not via contiguous array, then how can it be efficiently searched for the insertion point? And if the list is backed via a contiguous array, then there would have to be element shifting when inserting an element. So how this bisect work efficiently?
I believe the elements of a list are pointed at, but the "list" is really a contiguous array (in C). They're called lists, but they're not linked lists.
Actually, finding an element in a sorted list is pretty good - it's O(logn). But inserting is not that good - it's O(n).
If you need a logn datastructure, it'd be better to use a treap or red-black tree.
It's the searching that's efficient, not the actual insertion. The fast searching makes the whole operation "adding a value and keeping all values in order" fast compared to, for example, appending and then sorting again: O(n) rather than O(n log n).
Related
I'm trying to determine the complexity of converting a collections.deque object into a python list object is O(n). I imagine it would have to take every element and convert it into the list, but I cannot seem to find the implementation code behind deque. So has Python built in something more efficient behind the hood that could allow for O(1) conversion to a list?
Edit: Based off the following I do not believe it could be any faster than O(n)
"Indexed access is O(1) at both ends but slows to O(n) in the middle. For fast random access, use lists instead."
If it cannot access a middle node in O(1) time it will not be able to convert without the same complexity.
You have to access every node. O(1) time is impossible for that fact alone.
I would believe that a deque follows the same principles as conventional deques, in that it's constant time to access the first element. You have to do that for n elements, so the runtime to do so would be O(n).
Here is the implementation of deque
However, that is irrelevant for determining complexity to convert a deque to list in python.
If python is not reusing the data structure internally somehow, conversion into a list will require a walk through the deque and it will be O(n).
I'm trying to take a list (orig_list below), and return a list (new_list below) which:
does not contain duplicate items (i.e. contains only unique elements)
is sorted in reverse order
Here is what I have so far, which seems... I'm going to say "weird," though I'm sure there is a better way to say that. I'm mostly put off by using list() twice for what seems pretty straightforward, and then I'm wondering about the efficiency of this approach.
new_list = list(reversed(sorted(list(set(orig_list)))))
Question #1 (SO-style question):
Are the following propositions correct?
There is no more efficient way to get unique elements of a list than converting the list to a set and back.
Since sets are unordered in Python one must (1) convert to a set before removing duplicate items because otherwise you'd lose the sort anyway, and (2) you have to convert back to a list before you sort.
Using list(reversed()) is programatically equivalent to using list.sort(reversed=True).
Question #2 (bonus):
Are there any ways to achieve the same result in fewer Os, or using a less verbose approach? If so, what is an / are some example(s)?
sorted(set(orig_list), reverse=True)
Shortest in code, more efficient, same result.
Depending on the size, it may or may not be faster to sort first then dedupe in linear time as user2864740 suggests in comments. (The biggest drawback to that approach is it would be entirely in Python, while the above line executes mostly in native code.)
Your questions:
You do not need to convert from set to list and back. sorted accepts any iterable, so set qualifies, and spits out a list, so no post-conversion needed.
reversed(sorted(x)) is not equivalent to sorted(x, reverse=True). You get the same result, but slower - sort is of same speed whether forward or reverse, so reversed is adding an extra operation that is not needed if you sort to the proper ordering from the start.
You've got a few mildly wasteful steps in here, but your proposition is largely correct. The only real improvements to be made are to get rid of all the unnecessary temporary lists:
new_list = sorted(set(orig_list), reverse=True)
sorted already converts its input to a list (so no need to listify before passing to sorted), and you can have it directly produce the output list sorted in reverse (so no need to produce a list only to make a copy of it in reverse).
The only conceivable improvement on big-O time is if you know the data is already sorted, in which case you can avoid O(n log n) sorting, and uniqify without losing the existing sorted order by using itertools.groupby:
new_list = [key for key, grp in itertools.groupby(orig_list)]
If orig_list is sorted in forward order, you can make the result of this reversed at essentially no cost by changing itertools.groupby(orig_list) to itertools.groupby(reversed(orig_list)).
The groupby solution isn't really practical for initially unsorted inputs, because if duplicates are even remotely common, removing them via uniquification as a O(n) step is almost always worth it, as it reduces the n in the more costly O(n log n) sorting step. groupby is also a relatively slow tool; the nature of the implementation using a bunch of temporary iterators for each group, internal caching of values, etc., means that it's a slower O(n) in practice than the O(n) uniquification via set, with its primary advantage being the streaming aspect (making it scale to data sets streamed from disk or the network and back without storing anything for the long term, where set must pull everything into memory).
The other reason to use sorted+groupby would be if your data wasn't hashable, but was comparable; in that case, set isn't an option, so the only choice is sorting and grouping.
I am writing a program, that does a lot of deletions at either the front or back of a list of data, never the middle.
I understand that deletion of the last element is cheap, but how about deletion of the first element? For example let's say list A's address is at 4000, so element 0 is at 4000 and element 1 is at 4001.
Would deleting element 0 then just make the compiler put list A's address at 4001, or would it shift element 1 at 4001 to the location at 4000, and shift all other elements down by 1?
No, it isn't cheap. Removing an element from the front of the list (using list.pop(0), for example) is an O(N) operation and should be avoided. Similarly, inserting elements at the beginning (using list.insert(0, <value>)) is equally inefficient.
This is because, after the list is resized, it's elements must be shifted. For CPython, in the l.pop(0) case, this is done with memmove while for l.insert(0, <value>), the shifting is implemented with a loop through the items stored.
Lists are built for fast random access and O(1) operations on their end.
Since you're doing this operation commonly, though, you should consider using a deque from the collections module (as #ayhan suggested in a comment). The docs on deque also highlight how list objects aren't suitable for these operations:
Though list objects support similar operations, they are optimized for fast fixed-length operations and incur O(n) memory movement costs for pop(0) and insert(0, v) operations which change both the size and position of the underlying data representation.
(Emphasis mine)
The deque data structure offers O(1) complexity for both sides (beginning and end) with appendleft/popleft and append/pop methods for the beginning and end respectively.
Of course, with small sizes this incurs some extra space requirements (due to the structure of the deque) which should generally be of no concern (and as #juanpa noted in a comment, doesn't always hold) as the sizes of the lists grow. Finally, as #ShadowRanger's insightful comment notes, with really small sequence sizes the problem of popping or inserting from the front is trivialized to the point that it becomes of really no concern.
So, in short, for lists with many items, use deque if you need fast appends/pops from both sides, else, if you're randomly accessing and appending to the end, use lists.
Removing elements from the front of a list in Python is O(n), while removing elements from the ends of a collections.deque is only O(1). A deque would be great for your purpose as a result, however it should be noted that accessing or adding/removing from the middle of a deque is more costly than for a list.
The O(n) cost for removal is because a list in CPython is simply implemented as an array of pointers, thus your intuition regarding the shifting cost for each element is correct.
This can be seen in the Python TimeComplexity page on the Wiki.
I'm relatively new to python (using v3.x syntax) and would appreciate notes regarding complexity and performance of heapq vs. sorted.
I've already implemented a heapq based solution for a greedy 'find the best job schedule' algorithm. But then I've learned about the possibility of using 'sorted' together with operator.itemgetter() and reverse=True.
Sadly, I could not find any explanation on expected complexity and/or performance of 'sorted' vs. heapq.
If you use binary heap to pop all elements in order, the thing you do is basically heapsort. It is slower than sort algorightm in sorted function apart from it's implementation is pure python.
The heapq is faster than sorted in case if you need to add elements on the fly i.e. additions and insertions could come in unspecified order. Adding new element preserving inner order in any heap is faster than resorting array after each insertion.
The sorted is faster if you will need to retrieve all elements in order later.
The only problem where they can compete - if you need some portion of smallest (or largest) elements from collection. Although there are special algorigthms for that case, whether heapq or sorted will be faster here depends on the size of the initial array and portion you'll need to extract.
The nlargest() and nsmallest() functions of heapq are most appropriate if you are trying to find a relatively small number of items. If you want to find simply single smallest or largest number , min() and max() are most suitable, because it's faster and uses sorted and then slicing. If you are looking for the N smallest or largest items and N is small compared to the overall size of the collection, these functions provide superior performance. Although it's not necessary to use heapq in your code, it's just an interesting topic and a worthwhile subject of study.
heapq is implemented as a binary heap,
The key things to note about binary heaps, and by extension, heapq:
Searching is not supported
Insertions are constant time on average
Deletions are O(log n) time on average
Additional binary heap info described here: http://en.wikipedia.org/wiki/Binary_heap
While heapq is a data structure which has the properties of a binary heap, using sorted is a different concept. sorted returns a sorted list, so that's essentially a result, whereas the heapq is a data structure you are continually working with, which could, optionally, be sorted via sorted.
Additonal sorted info here: https://docs.python.org/3.4/library/functions.html#sorted
What specifically are you trying to accomplish?
Response to OP's comment:
Why do you think you need a heapq specifically? A binary heap is a specialized data structure, and depending on your requirements, it's quite likely not necessary.
You seem to be extremely concerned about performance, but it's not clear why. If something is a "bad performer", but its aggregate time is not significant, then it really doesn't matter in the bigger picture. In the aggregate case, a dict or a list would perform generally perform fine. Why do you specifically think a heapq is needed?
I wonder if this is a don't-let-the-perfect-be-the-enemy-of-the-good type of situation.
Writing Python using C extensions is a niche use case reserved for cases where performance is truly a significant issue. (i.e. it may be better to use, say, an XML parser that is a C extension than something that is pure Python if you're dealing with large files and if performance is your main concern).
Regarding In complex keep playing with structure case: could it be faster to sort with sorted and add elements via .append():
I'm still not clear what the use case is here. As I mentioned above, sorted and heapq are really two different concepts.
What is the use case for which you are so concerned about performance? (Absent other factors not yet specified, I think you may be overly emphasizing the importance of best-case performance in your code here.)
If I have a long unsorted list of 300k elements, will sorting this list first and then do a "for" loop on list speed up code? I need to do a "for loop" regardless, cant use list comprehension.
sortedL=[list].sort()
for i in sortedL:
(if i is somenumber)
"do some work"
How could I signal to python that sortedL is sorted and not read whole list. Is there any benefit to sorting a list? If there is then how can I implement?
It would appear that you're considering sorting the list so that you could then quickly look for somenumber.
Whether the sorting will be worth it depends on whether you are going to search once, or repeatedly:
If you're only searching once, sorting the list will not speed things up. Just iterate over the list looking for the element, and you're done.
If, on the other hand, you need to search for values repeatedly, by all means pre-sort the list. This will enable you to use bisect to quickly look up values.
The third option is to store elements in a dict. This might offer the fastest lookups, but will probably be less memory-efficient than using a list.
The cost of a for loop in python is not dependent on whether the input data is sorted.
That being said, you might be able to break out of the for loop early or have other computation saving things at the algorithm level if you sort first.
If you want to search within a sorted list, you need an algorithm that takes advantage of the sorting.
One possibility is the built-in bisect module. This is a bit of a pain to use, but there's a recipe in the documentation for building simple sorted-list functions on top of it.
With that recipe, you can just write this:
i = index(sortedL, somenumber)
Of course if you're just sorting for the purposes of speeding up a single search, this is a bit silly. Sorting will take O(N log N) time, then searching will take O(log N), for a total time of O(N log N); just doing a linear search will take O(N) time. So, unless you're typically doing log N searches on the same list, this isn't worth doing.
If you don't actually need sorting, just fast lookups, you can use a set instead of a list. This gives you O(1) lookup for all but pathological cases.
Also, if you want to keep a list sorted while continuing to add/remove/etc., consider using something like blist.sortedlist instead of a plain list.