Using dictionary instead of sorting and then searching

Using dictionary instead of sorting and then searching - python

I was studying hash tables and a thought came:
Why not use dictionaries for searching an element instead of first sorting the list then doing binary search? (assume that I want to search multiple times)
We can convert a list to a dictionary in O(n) (I think) time because we have to go through all the elements.
We add all those elements to dictionary and this takes O(1) time
When the dictionary is ready,we can then search for any element in O(1) time(average) and O(n) is the worst case
Now if we talk about average case O(n) is better than other sorting algorithms because at best they take O(nlogn).And if I am right about all of what I have said then why not do this way?
I know there are various other things which you can do with the sorted elements which cannot be done in an unsorted dictionary or array.But if we stick only to search then Is it not a better way to do search than other sorting algorithms?

Right, a well-designed hash table can beat sorting and searching.
For a proper choice, there are many factors entering into play such as in-place requirement, dynamism of the data set, number of searches vs. insertions/deletions, ease to build an effective hashing function...

Binary Search is a searching technique which exploits the fact that list of keys in which a key is to be searched is already sorted, it doesn't requires you to sort and then search, making its worst case search time O(log n).
If you do not have a sorted list of keys and want to search a key then you will have to go for linear search which in worst case will run with O(n) complexity, there is no need to sort and then search which definitely slower since best known sorting algos can work in only O(n log n) time.
Building a dictionary from a list of keys and then performing a lookup is of no advantage here because linear search will yield the same for better performance and also there need for auxiliary memory which would be needed in case of dictionary; however if you have multiple lookups and key space is small using a dictionary can of advantage since building the dictionary is one time work of O(n) and subsequent lookups can be done by O(1) at the expense of some memory which will be used by the dictionary.

Related

searching an unsorted list of elements in python

Apart from Binary search - do we have any other algorithm having lesser number of comparisons.
Further Binary search would work on a Sorted list. what if the elements are unsorted ?
if Number of elements (= n) is a big number. Then the run time would be high, if I opt to sort it and then run a binary search on the same.
is there any other alternative.

Sorting has a cost of O(n*log(n)) in the mean case if you use Timsort, Python's default sorting algorithm, so it's only worth ordering if you are gonna perform many searches and the array is not gonna have new elements since you'll have to reorder which has cost O(n).
On the other hand, since you have to look every value individually, I don't think there are better ways unless you use parallel programing, that way several threads could look in different values at the same time.

Best / most pythonic way to remove duplicate from the a list and sort in reverse order

I'm trying to take a list (orig_list below), and return a list (new_list below) which:
does not contain duplicate items (i.e. contains only unique elements)
is sorted in reverse order
Here is what I have so far, which seems... I'm going to say "weird," though I'm sure there is a better way to say that. I'm mostly put off by using list() twice for what seems pretty straightforward, and then I'm wondering about the efficiency of this approach.
new_list = list(reversed(sorted(list(set(orig_list)))))
Question #1 (SO-style question):
Are the following propositions correct?
There is no more efficient way to get unique elements of a list than converting the list to a set and back.
Since sets are unordered in Python one must (1) convert to a set before removing duplicate items because otherwise you'd lose the sort anyway, and (2) you have to convert back to a list before you sort.
Using list(reversed()) is programatically equivalent to using list.sort(reversed=True).
Question #2 (bonus):
Are there any ways to achieve the same result in fewer Os, or using a less verbose approach? If so, what is an / are some example(s)?

sorted(set(orig_list), reverse=True)
Shortest in code, more efficient, same result.
Depending on the size, it may or may not be faster to sort first then dedupe in linear time as user2864740 suggests in comments. (The biggest drawback to that approach is it would be entirely in Python, while the above line executes mostly in native code.)
Your questions:
You do not need to convert from set to list and back. sorted accepts any iterable, so set qualifies, and spits out a list, so no post-conversion needed.
reversed(sorted(x)) is not equivalent to sorted(x, reverse=True). You get the same result, but slower - sort is of same speed whether forward or reverse, so reversed is adding an extra operation that is not needed if you sort to the proper ordering from the start.

You've got a few mildly wasteful steps in here, but your proposition is largely correct. The only real improvements to be made are to get rid of all the unnecessary temporary lists:
new_list = sorted(set(orig_list), reverse=True)
sorted already converts its input to a list (so no need to listify before passing to sorted), and you can have it directly produce the output list sorted in reverse (so no need to produce a list only to make a copy of it in reverse).
The only conceivable improvement on big-O time is if you know the data is already sorted, in which case you can avoid O(n log n) sorting, and uniqify without losing the existing sorted order by using itertools.groupby:
new_list = [key for key, grp in itertools.groupby(orig_list)]
If orig_list is sorted in forward order, you can make the result of this reversed at essentially no cost by changing itertools.groupby(orig_list) to itertools.groupby(reversed(orig_list)).
The groupby solution isn't really practical for initially unsorted inputs, because if duplicates are even remotely common, removing them via uniquification as a O(n) step is almost always worth it, as it reduces the n in the more costly O(n log n) sorting step. groupby is also a relatively slow tool; the nature of the implementation using a bunch of temporary iterators for each group, internal caching of values, etc., means that it's a slower O(n) in practice than the O(n) uniquification via set, with its primary advantage being the streaming aspect (making it scale to data sets streamed from disk or the network and back without storing anything for the long term, where set must pull everything into memory).
The other reason to use sorted+groupby would be if your data wasn't hashable, but was comparable; in that case, set isn't an option, so the only choice is sorting and grouping.

Python heapq vs. sorted complexity and performance

I'm relatively new to python (using v3.x syntax) and would appreciate notes regarding complexity and performance of heapq vs. sorted.
I've already implemented a heapq based solution for a greedy 'find the best job schedule' algorithm. But then I've learned about the possibility of using 'sorted' together with operator.itemgetter() and reverse=True.
Sadly, I could not find any explanation on expected complexity and/or performance of 'sorted' vs. heapq.

If you use binary heap to pop all elements in order, the thing you do is basically heapsort. It is slower than sort algorightm in sorted function apart from it's implementation is pure python.
The heapq is faster than sorted in case if you need to add elements on the fly i.e. additions and insertions could come in unspecified order. Adding new element preserving inner order in any heap is faster than resorting array after each insertion.
The sorted is faster if you will need to retrieve all elements in order later.
The only problem where they can compete - if you need some portion of smallest (or largest) elements from collection. Although there are special algorigthms for that case, whether heapq or sorted will be faster here depends on the size of the initial array and portion you'll need to extract.

The nlargest() and nsmallest() functions of heapq are most appropriate if you are trying to find a relatively small number of items. If you want to find simply single smallest or largest number , min() and max() are most suitable, because it's faster and uses sorted and then slicing. If you are looking for the N smallest or largest items and N is small compared to the overall size of the collection, these functions provide superior performance. Although it's not necessary to use heapq in your code, it's just an interesting topic and a worthwhile subject of study.

heapq is implemented as a binary heap,
The key things to note about binary heaps, and by extension, heapq:
Searching is not supported
Insertions are constant time on average
Deletions are O(log n) time on average
Additional binary heap info described here: http://en.wikipedia.org/wiki/Binary_heap
While heapq is a data structure which has the properties of a binary heap, using sorted is a different concept. sorted returns a sorted list, so that's essentially a result, whereas the heapq is a data structure you are continually working with, which could, optionally, be sorted via sorted.
Additonal sorted info here: https://docs.python.org/3.4/library/functions.html#sorted
What specifically are you trying to accomplish?
Response to OP's comment:
Why do you think you need a heapq specifically? A binary heap is a specialized data structure, and depending on your requirements, it's quite likely not necessary.
You seem to be extremely concerned about performance, but it's not clear why. If something is a "bad performer", but its aggregate time is not significant, then it really doesn't matter in the bigger picture. In the aggregate case, a dict or a list would perform generally perform fine. Why do you specifically think a heapq is needed?
I wonder if this is a don't-let-the-perfect-be-the-enemy-of-the-good type of situation.
Writing Python using C extensions is a niche use case reserved for cases where performance is truly a significant issue. (i.e. it may be better to use, say, an XML parser that is a C extension than something that is pure Python if you're dealing with large files and if performance is your main concern).
Regarding In complex keep playing with structure case: could it be faster to sort with sorted and add elements via .append():
I'm still not clear what the use case is here. As I mentioned above, sorted and heapq are really two different concepts.
What is the use case for which you are so concerned about performance? (Absent other factors not yet specified, I think you may be overly emphasizing the importance of best-case performance in your code here.)

Can you speed up "for " loop in python with sorting ?

If I have a long unsorted list of 300k elements, will sorting this list first and then do a "for" loop on list speed up code? I need to do a "for loop" regardless, cant use list comprehension.
sortedL=[list].sort()
for i in sortedL:
(if i is somenumber)
"do some work"
How could I signal to python that sortedL is sorted and not read whole list. Is there any benefit to sorting a list? If there is then how can I implement?

It would appear that you're considering sorting the list so that you could then quickly look for somenumber.
Whether the sorting will be worth it depends on whether you are going to search once, or repeatedly:
If you're only searching once, sorting the list will not speed things up. Just iterate over the list looking for the element, and you're done.
If, on the other hand, you need to search for values repeatedly, by all means pre-sort the list. This will enable you to use bisect to quickly look up values.
The third option is to store elements in a dict. This might offer the fastest lookups, but will probably be less memory-efficient than using a list.

The cost of a for loop in python is not dependent on whether the input data is sorted.
That being said, you might be able to break out of the for loop early or have other computation saving things at the algorithm level if you sort first.

If you want to search within a sorted list, you need an algorithm that takes advantage of the sorting.
One possibility is the built-in bisect module. This is a bit of a pain to use, but there's a recipe in the documentation for building simple sorted-list functions on top of it.
With that recipe, you can just write this:
i = index(sortedL, somenumber)
Of course if you're just sorting for the purposes of speeding up a single search, this is a bit silly. Sorting will take O(N log N) time, then searching will take O(log N), for a total time of O(N log N); just doing a linear search will take O(N) time. So, unless you're typically doing log N searches on the same list, this isn't worth doing.
If you don't actually need sorting, just fast lookups, you can use a set instead of a list. This gives you O(1) lookup for all but pathological cases.
Also, if you want to keep a list sorted while continuing to add/remove/etc., consider using something like blist.sortedlist instead of a plain list.

What makes sets faster than lists?

The python wiki says: "Membership testing with sets and dictionaries is much faster, O(1), than searching sequences, O(n). When testing "a in b", b should be a set or dictionary instead of a list or tuple."
I've been using sets in place of lists whenever speed is important in my code, but lately I've been wondering why sets are so much faster than lists. Could anyone explain, or point me to a source that would explain, what exactly is going on behind the scenes in python to make sets faster?

list: Imagine you are looking for your socks in your closet, but you don't know in which drawer your socks are, so you have to search drawer by drawer until you find them (or maybe you never do). That's what we call O(n), because in the worst scenario, you will look in all your drawers (where n is the number of drawers).
set: Now, imagine you're still looking for your socks in your closet, but now you know in which drawer your socks are, say in the 3rd drawer. So, you will just search in the 3rd drawer, instead of searching in all drawers. That's what we call O(1), because in the worst scenario you will look in just one drawer.

Sets are implemented using hash tables. Whenever you add an object to a set, the position within the memory of the set object is determined using the hash of the object to be added. When testing for membership, all that needs to be done is basically to look if the object is at the position determined by its hash, so the speed of this operation does not depend on the size of the set. For lists, in contrast, the whole list needs to be searched, which will become slower as the list grows.
This is also the reason that sets do not preserve the order of the objects you add.
Note that sets aren't faster than lists in general -- membership test is faster for sets, and so is removing an element. As long as you don't need these operations, lists are often faster.

I think you need to take a good look at a book on data structures. Basically, Python lists are implemented as dynamic arrays and sets are implemented as a hash tables.
The implementation of these data structures gives them radically different characteristics. For instance, a hash table has a very fast lookup time but cannot preserve the order of insertion.

While I have not measured anything performance related in python so far, I'd still like to point out that lists are often faster.
Yes, you have O(1) vs. O(n). But always remember that this gives information only about the asymptotic behavior of something. That means if your n is very high O(1) will always be faster - theoretically. In practice however n often needs to be much bigger than your usual data set will be.
So sets are not faster than lists per se, but only if you have to handle a lot of elements.

Python uses hashtables, which have O(1) lookup.

Basically, Depends on the operation you are doing …
*For adding an element - then a set doesn’t need to move any data, and all it needs to do is calculate a hash value and add it to a table. For a list insertion then potentially there is data to be moved.
*For deleting an element - all a set needs to do is remove the hash entry from the hash table, for a list it potentially needs to move data around (on average 1/2 of the data.
*For a search (i.e. an in operator) - a set just needs to calculate the hash value of the data item, find that hash value in the hash table, and if it is there - then bingo. For a list, the search has to look up each item in turn - on average 1/2 of all of the terms in the list. Even for many 1000s of items a set will be far quicker to search.

Actually sets are not faster than lists in every scenario. Generally the lists are faster than sets. But in the case of searching for an element in a collection, sets are faster because sets have been implemented using hash tables. So basically Python does not have to search the full set, which means that the time complexity in average is O(1). Lists use dynamic arrays and Python needs to check the full array to search. So it takes O(n).
So finally we can see that sets are better in some case and lists are better in some cases. Its up to us to select the appropriate data structure according to our task.

A list must be searched one by one, where a set or dictionary has an index for faster searching.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.