Complexity for set

Complexity for set - python

this was my interview question which I got wrong and I am very confused by it.
fruits = {apples, grapes, oranges, pears}
for fruit in fruits:
print(fruit)
My thinking was we access the O(1) N times, so the time complexity is O(n). However, they said that I am incorrect and the answer is O(1). It was multiple-choice, so I did not get feedback. Can anyone explain it?

If you assume the fruits array have a constant amount of members (4) in every case the loop will take constant time any time you use. ;-)

The complexity is O(1) in most situations.
Explanation:
Lookup/Insert/Delete have O(1) complexity as an average case. The reason here is they make use of Hashtable.
How does the Hashtable do it in O(1)?
Whenever you try to insert a new value into a set, the value goes through a hash function and its position is decided b y that hash function. So, calculating that position is O(1) and inserting it at a known position is O(1).
Now, when you wish to find a value x, you just need to know the hash function to know the position of x. You need not loop through all the values (which would make it O(n)).
When is task of Lookup O(n)?
As previously explained, you take a value, insert it into a hash function and find its position as to where you can place it. Imagine a hash function that, in some cases, places different values at the same location. So, when you try to lookup, you know the position, but you still have to do a linear search among all the elements to find the required element. Imagine the worst case "hypothetical" scenario when the hash function places all the elements in a single position. So, here we will again have a linear search among n elements that makes the lookup have a complexity of O(n).
Let me know in comments if you have doubts. And do like it if you gained some information.

Related

Is there a way to find minimum element in a set in less than O(n) time?

I am trying to get minimum value from a set in python by using min().
Constraints:
I am using a dictionary which maps a node to a value but this dictionary is constantly being updated.
I am reducing the size of set constantly.
My set has 510650 elements.
current = min(unvisited, key=lambda node: distanceTo[node]).
Name of dictionary is 'distanceTo'.
Name of set is 'unvisited'.

If I understand big-o notation correctly, I don't believe so.
"O(n) means that processing time increases linearly with the size of the input"
If you consider it logically, for each item you add to an unsorted list, that's one more element that you need to compare to see if it meets your specific requirement (in your case, having the least value)
https://en.wikipedia.org/wiki/Time_complexity#Linear_time
Edit:
Of course, if someone simply wanted to find a specific value in a list, and that value happened to be towards the beginning of where he started searching, it would be found in less than O(n); probably not relevant but I felt it was worth mentioning.

List vs Dictionary vs Set for finding the number of times a number appears

What data structure - Set, List or Dictionary - in Python will you use to store 10 million integers? Query operations consist of finding the number of times a number appears in the given set. What will be the worst case time and space complexity in all the cases?
This is a sample interview question. What would be the most appropriate answer?

The key to this question is the line that states:
"finding the number of times a number appears in a given set"
The set data structure is going to be incapable of keeping a count of how many times a number appears within the total dataset, and a List is going to be extremely costly to iterate over. Which leaves a dictionary as the only viable option.
Breaking down the options:
Set:
A set automatically de-dupes values added to the set that already exist. So it would be impossible to query the frequency that a number appeared within the stored dataset using a set, because the answer for all numbers stored will be 1.
Time complexity for querying: O(1)
Space complexity for storing: O(n)
List:
A list could be iterated over to determine the frequency of a given number within the list. However this is going to be O(n) operation, and for 10 million integers will not be efficient.
Time complexity for querying: O(n)
Space complexity for storing: O(n)
Dictionary:
A dictionary allows you to store a key-value pair. In this case, you would store the number to be searched as the key, and the count of how many times it has been stored as the associated value. Because of the way that dictionaries will hash keys into distinct buckets (There can be collisions, but let's assume a non-colliding theoretical dictionary for now), the lookup time for a given key approaches O(1). Calculating the count however, is going to slow down a Dictionary; it will take O(n) time complexity to calculate the counts for all keys (because each key will have to be hit at least once in order to append it's count to the running count stored in the value).
Time complexity for querying: O(1)
Time complexity for storing: O(n)
Space complexity for storing: O(2n) = O(n)

Adding to the answer of #John Stark
From the python wiki, the time complexity for querying in a set is O(n). This is because it uses a hash to get the value, but with (a LOT) of bad luck, you might have a hash collision for every key. In the vast majority of cases, however, you won't have collision.
Also, because here the keys are integers, you reduce the hash collisions, if the range of integers is limited. In python 2 with the type int, you can't have collisions.

Add every number as number:1 in dict if number not in dict, else add 1 to Val of that Key.
Then search for specific number as Key, the Val will be the number of times that appears.

Optimizing Python Dictionary Lookup Speeds by Shortening Key Size?

I'm not clear on what goes on behind the scenes of a dictionary lookup. Does key size factor into the speed of lookup for that key?
Current dictionary keys are between 10-20 long, alphanumeric.
I need to do hundreds of lookups a minute.
If I replace those with smaller key IDs of between 1 & 4 digits will I get faster lookup times? This would mean I would need to add another value in each item the dictionary is holding. Overall the dictionary will be larger.
Also I'll need to change the program to lookup the ID then get the URL associated with the ID.
Am I likely just adding complexity to the program with little benefit?

Dictionaries are hash tables, so looking up a key consists of:
Hash the key.
Reduce the hash to the table size.
Index the table with the result.
Compare the looked-up key with the input key.
Normally, this is amortized constant time, and you don't care about anything more than that. There are two potential issues, but they don't come up often.
Hashing the key takes linear time in the length of the key. For, e.g., huge strings, this could be a problem. However, if you look at the source code for most of the important types, including [str/unicode](https://hg.python.org/cpython/file/default/Objects/unicodeobject.c, you'll see that they cache the hash the first time. So, unless you're inputting (or randomly creating, or whatever) a bunch of strings to look up once and then throw away, this is unlikely to be an issue in most real-life programs.
On top of that, 20 characters is really pretty short; you can probably do millions of such hashes per second, not hundreds.
From a quick test on my computer, hashing 20 random letters takes 973ns, hashing a 4-digit number takes 94ns, and hashing a value I've already hashed takes 77ns. Yes, that's nanoseconds.
Meanwhile, "Index the table with the result" is a bit of a cheat. What happens if two different keys hash to the same index? Then "compare the looked-up key" will fail, and… what happens next? CPython's implementation uses probing for this. The exact algorithm is explained pretty nicely in the source. But you'll notice that given really pathological data, you could end up doing a linear search for every single element. This is never going to come up—unless someone can attack your program by explicitly crafting pathological data, in which case it will definitely come up.
Switching from 20-character strings to 4-digit numbers wouldn't help here either. If I'm crafting keys to DoS your system via dictionary collisions, I don't care what your actual keys look like, just what they hash to.
More generally, premature optimization is the root of all evil. This is sometimes misquoted to overstate the point; Knuth was arguing that the most important thing to do is find the 3% of the cases where optimization is important, not that optimization is always a waste of time. But either way, the point is: if you don't know in advance where your program is too slow (and if you think you know in advance, you're usually wrong…), profile it, and then find the part where you get the most bang for your buck. Optimizing one arbitrary piece of your code is likely to have no measurable effect at all.

Python dictionaries are implemented as hash-maps in the background. The key length might have some impact on the performance if, for example, the hash-functions complexity depends on the key-length. But in general the performance impacts will be definitely negligable.
So I'd say there is little to no benefit for the added complexity.

Can you speed up "for " loop in python with sorting ?

If I have a long unsorted list of 300k elements, will sorting this list first and then do a "for" loop on list speed up code? I need to do a "for loop" regardless, cant use list comprehension.
sortedL=[list].sort()
for i in sortedL:
(if i is somenumber)
"do some work"
How could I signal to python that sortedL is sorted and not read whole list. Is there any benefit to sorting a list? If there is then how can I implement?

It would appear that you're considering sorting the list so that you could then quickly look for somenumber.
Whether the sorting will be worth it depends on whether you are going to search once, or repeatedly:
If you're only searching once, sorting the list will not speed things up. Just iterate over the list looking for the element, and you're done.
If, on the other hand, you need to search for values repeatedly, by all means pre-sort the list. This will enable you to use bisect to quickly look up values.
The third option is to store elements in a dict. This might offer the fastest lookups, but will probably be less memory-efficient than using a list.

The cost of a for loop in python is not dependent on whether the input data is sorted.
That being said, you might be able to break out of the for loop early or have other computation saving things at the algorithm level if you sort first.

If you want to search within a sorted list, you need an algorithm that takes advantage of the sorting.
One possibility is the built-in bisect module. This is a bit of a pain to use, but there's a recipe in the documentation for building simple sorted-list functions on top of it.
With that recipe, you can just write this:
i = index(sortedL, somenumber)
Of course if you're just sorting for the purposes of speeding up a single search, this is a bit silly. Sorting will take O(N log N) time, then searching will take O(log N), for a total time of O(N log N); just doing a linear search will take O(N) time. So, unless you're typically doing log N searches on the same list, this isn't worth doing.
If you don't actually need sorting, just fast lookups, you can use a set instead of a list. This gives you O(1) lookup for all but pathological cases.
Also, if you want to keep a list sorted while continuing to add/remove/etc., consider using something like blist.sortedlist instead of a plain list.

What makes sets faster than lists?

The python wiki says: "Membership testing with sets and dictionaries is much faster, O(1), than searching sequences, O(n). When testing "a in b", b should be a set or dictionary instead of a list or tuple."
I've been using sets in place of lists whenever speed is important in my code, but lately I've been wondering why sets are so much faster than lists. Could anyone explain, or point me to a source that would explain, what exactly is going on behind the scenes in python to make sets faster?

list: Imagine you are looking for your socks in your closet, but you don't know in which drawer your socks are, so you have to search drawer by drawer until you find them (or maybe you never do). That's what we call O(n), because in the worst scenario, you will look in all your drawers (where n is the number of drawers).
set: Now, imagine you're still looking for your socks in your closet, but now you know in which drawer your socks are, say in the 3rd drawer. So, you will just search in the 3rd drawer, instead of searching in all drawers. That's what we call O(1), because in the worst scenario you will look in just one drawer.

Sets are implemented using hash tables. Whenever you add an object to a set, the position within the memory of the set object is determined using the hash of the object to be added. When testing for membership, all that needs to be done is basically to look if the object is at the position determined by its hash, so the speed of this operation does not depend on the size of the set. For lists, in contrast, the whole list needs to be searched, which will become slower as the list grows.
This is also the reason that sets do not preserve the order of the objects you add.
Note that sets aren't faster than lists in general -- membership test is faster for sets, and so is removing an element. As long as you don't need these operations, lists are often faster.

I think you need to take a good look at a book on data structures. Basically, Python lists are implemented as dynamic arrays and sets are implemented as a hash tables.
The implementation of these data structures gives them radically different characteristics. For instance, a hash table has a very fast lookup time but cannot preserve the order of insertion.

While I have not measured anything performance related in python so far, I'd still like to point out that lists are often faster.
Yes, you have O(1) vs. O(n). But always remember that this gives information only about the asymptotic behavior of something. That means if your n is very high O(1) will always be faster - theoretically. In practice however n often needs to be much bigger than your usual data set will be.
So sets are not faster than lists per se, but only if you have to handle a lot of elements.

Python uses hashtables, which have O(1) lookup.

Basically, Depends on the operation you are doing …
*For adding an element - then a set doesn’t need to move any data, and all it needs to do is calculate a hash value and add it to a table. For a list insertion then potentially there is data to be moved.
*For deleting an element - all a set needs to do is remove the hash entry from the hash table, for a list it potentially needs to move data around (on average 1/2 of the data.
*For a search (i.e. an in operator) - a set just needs to calculate the hash value of the data item, find that hash value in the hash table, and if it is there - then bingo. For a list, the search has to look up each item in turn - on average 1/2 of all of the terms in the list. Even for many 1000s of items a set will be far quicker to search.

Actually sets are not faster than lists in every scenario. Generally the lists are faster than sets. But in the case of searching for an element in a collection, sets are faster because sets have been implemented using hash tables. So basically Python does not have to search the full set, which means that the time complexity in average is O(1). Lists use dynamic arrays and Python needs to check the full array to search. So it takes O(n).
So finally we can see that sets are better in some case and lists are better in some cases. Its up to us to select the appropriate data structure according to our task.

A list must be searched one by one, where a set or dictionary has an index for faster searching.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.