Time complexity of python "set.intersection" for n sets - python

I want to know the complexity of the set.intersection of python. I looked in the documentations and the online wikis for python, but I did not find the time complexity of this method for multiple sets.

The python wiki on time complexity lists a single intersection as O(min(len(s), len(t)) where s and t are the sizes of the sets and t is a set. (In English: the time is bounded by and linear in the size of the smaller set.)
Note: based on the comments below, this wiki entry had been be wrong if the argument passed is not a set. I've corrected the wiki entry.
If you have n sets (sets, not iterables), you'll do n-1 intersections and the time can be
(n-1)O(len(s)) where s is the set with the smallest size.
Note that as you do an intersection the result may get smaller, so although O is the worst case, in practice, the time will be better than this.
However, looking at the specific code this idea of taking the min() only applies to a single pair of sets and doesn't extend to multiple sets. So in this case, we have to be pessimistic and take s as the set with the largest size.

Related

Why do we only consider size of an input when estimating algorithm's complexity?

For the sake of the argument, consider following (very bad) sorting algorithm in python:
def so(ar):
while True:
le = len(ar)
switch = False
for y in range(le):
if y+1 == le:
break
if ar[y] > ar[y+1]:
ar[y],ar[y+1] = ar[y+1],ar[y]
switch = True
if switch == False:
break
return ar
I'm trying to understand the concept of "complexity of the algorithm" and there is one thing I don't get.
I came across the post that explains how to find the complexity of the algorithm here:
You add up how many machine instructions it will execute as a function
of the size of its input, and then simplify the expression to the
largest (when N is very large) term and can include any simplifying
constant factor.
But well, the problem is, that I cannot calculate how many machine instructions will be executed just
by knowing the length of the list.
Consider first example:
li = [random.randint(1,5000) for x in range(3000)]
start = time.time()
so(li)
end = time.time() - start
print(end)
Output: 2.96921706199646
Now have a look at the second example:
ok = [5000,43000,232] + [x for x in range(2997)]
start = time.time()
so(ok)
end = time.time() - start
print(end)
Output: 0.010689020156860352
We can see that the same sorting algorithm, two different lists, lists are the same length, and two completely different execution times.
When people are talking about algorithm complexity (big O notation) they normally assume that the only variable that determines complexity of the algo is the size of the input, but clearly, in the example above it is not the case. It is not only the size of the list, but also the positioning of each value within the list that determines the speed of the sorting.
So my question is, why do we only consider size of input when estimating complexity?
And, if it is possible, can you tell me what the complexity of the algorithm above will be?
You're correct, complexity doesn't only depend on N. That's why you'll often see indications about average, worst and best cases.
Timsort is used in Python because it's (O n log n) on average, still fast for worst-cases (O(n log n)) and extremely fast for best-cases (O(n), when the list is already sorted).
Quicksort also has an average complexity of O(n log n), but its worst case is O(n²), when the list is already sorted. This use case happens very often, so it might be worth it to actually shuffle the list before sorting it!
why do we only consider size of input when estimating complexity?
In the narrow sense of complexity as of the use of Big O notation in computer science, it is simply by definition:
In computer science, big O notation is used to classify algorithms according to how their running time or space requirements grow as the input size grows.
In the broader sense your question could be interpreted as "why do we use Big O notation to describe algorithm complexity when the nature of the data can be just as important as its size."
The answer here lies in the fact that algorithm development is often done on small datasets to make it easy, while in the real world the datasets are huge. When you are writing your sorting function you're most likely going to try it first on small lists of random data. You'd want the result small enough that you can verify that it worked by simply looking at the result...
The time complexity is not always definitely dependent on size of input. When we look at randomized sorting algorithms, the input patterns might play a significant role in determining time complexity.
We usually calculate time complexity in terms of worst, good and average case and could particularly study time complexity in terms of specific input order/patterns which could lead to good, average and best case time complexity.
For example, in first case provided by you, since input is randomized, there is 1/n! probability for a particular input to occur. The good case (when the list is sorted already) is Ω(n) and the worst case(when the list is reversely sorted) is O(n²) , but the probability is low for best or worst case to occur.
Therefore, the sorting algorithm has θ(n²) average time complexity since the probability of comparison and swap in case of two elements in average case input is high due to random distribution of numbers.
In the second case, the order is strict which means high probability for input to tend toward best case or worst case time complexity . In your case, input is more tending towards good case, therefore lesser time.

How is frozenset equality implemented in Python?

How is frozenset equality implemented in CPython? In particular, I want to know how individual elements in the fronzenset are compared with each other and their total time complexity.
I took a look at set and frozenset difference in implementation and https://stackoverflow.com/a/51778365/5792721.
The answer in the second link covers steps before comparing individual elements in the set, but missing the actual comparison algorithm.
The implementation in question can be found here. The algorithm goes something like this, given that we want to check if two sets v and w are equal:
Check if their sizes are equal. If they are not, return false.
This references the size attribute of the PySetObject, so it's a constant time operation.
Check if both sets are frozensets. If so, and their hashes are different, return false.
Again, this references the hash attribute of the PySetObject, which is a constant time operation. This is possible because frozensets are immutable objects and hence have their hashes calculated when they are created.
Check if w is a subset of v.
This is done by iterating through each element of v and checking if it is present in w.
The iteration must be performed over all of v, and is therefore at worst linear in its size (if a differing element is found in the last position).
Membership testing for sets in general takes constant time in the average case and linear time in the worst case; combining the two gives time linear in the size of v in the average case and time proportional to len(v) * len(w) in the worst case.
This is, in a sense, the average case of the worst case, since it assumes that the first two checks always return true. If your data only rarely tends to have equal-size sets which also have the same hashes but contain different objects, then in the average case the comparison will still run in constant time.
Both set and frozenset use the set_richcompare() function for equality tests. There's no difference here.
set_richcompare(), as the other answers mention, compares sizes and then hashes. Then, it checks whether one is a subset of the other, which just loops over every element in set 1, and checks if it is in set 2. This check is otherwise identical to checking whether an element is in a set. This is done by hash-table lookup, with a linear search over items with equal hashes. The complexity for the existence check is O(1) amortized or O(len(s)) worst-case, so the check for set_issubset() has a complexity of O(len(s1)) amortized, or O(len(s1)*len(s2)) worst-case.

Which data structure is appropriate for this?

I have a line in my code that currently does this at each step x:
myList = [(lo,hi) for lo,hi in myList if lo <= x <= hi]
This is pretty slow. Is there a more efficient way to eliminate things from a list that don't contain a given x?
Perhaps you're looking for an interval tree. From Wikipedia:
In computer science, an interval tree is an ordered tree data structure to hold intervals. Specifically, it allows one to efficiently find all intervals that overlap with any given interval or point.
So, instead of storing the (lo, hi) pairs sequentially in a list, you would have them define the intervals in an interval tree. Then you could perform queries on the tree with x, and retain only the intervals that overlap x.
While you don't give much context, I'll assume the rest of the loop looks like:
for x in xlist:
myList = [(lo,hi) for lo,hi in myList if lo <= x <= hi]
In this case, if may be more efficient to construct an interval tree (http://en.wikipedia.org/wiki/Interval_tree) first. Then, for each x you walk the tree and find all intervals which intersect with x; add these intervals to a set as you find them.
Here I'm going to suggest what may seem like a really dumb solution favoring micro-optimizations over algorithmic ones. It'll depend on your specific needs.
The ultimate question is this: is a single linear pass over your array (list in Python), on average, expensive? In other words, is searching for lo/high pairs that contain x going to generally yield results that are very small (ex: 1% of the overall size of the list), or relatively quite large (ex: 25% or more of the original list)?
If the answer is the latter, you might actually get a more efficient solution keeping a basic, contiguous, cache-friendly representation that you're accessing sequentially. The hardware cache excels at plowing through contiguous data where multiple adjacent elements fit into a cache line sequentially.
What you want to avoid in such a case is the expensive linear-time removal from the middle of the array as well as possibly the construction of a new one. If you trigger a linear-time operation for every single individual element you remove from the array, then naturally that's going to get very expensive very quickly.
To exchange that linear-time operation for a much faster constant-time one, all we have to do when we want to remove an element at a certain index in the array is to overwrite the element at that index with the element at the back of the array (last element). Now simply remove the redundant duplicate from the back of the array (a removal from the back of an array is a constant-time operation, and often involves just basic arithmetical instructions).
If your needs fit the criteria, then this can actually give you better results than a smarter algorithm. It's one of the peculiar cases where the practice can trump the theory due to the skewed performance of the hardware cache over DRAM, but if you're performing these types of hi/lo queries repeatedly and wanting to get very narrow results, then something smarter like an interval tree or at least sorting the data to allow binary searches can be considerably better.

How to calculate the algorithmic complexity of Python functions? [duplicate]

This question already has answers here:
Python Time Complexity (run-time)
(6 answers)
Closed 2 years ago.
When required to show how efficient the algorithm is, we need to show the algorithmic complexity of functions - Big O and so on. In Python code, how can we show or calculate the bounds of functions?
In general, there's no way to do this programmatically (you run into the halting problem).
If you have no idea where to start, you can gain some insight into how a function will perform by running some benchmarks (e.g. using the time module) with inputs of various sizes. You can even collect enough data to form a suspicion about what the runtime might be. But this won't give you a rigorous answer - for that, you need to prove mathematically that your suspected bound is in fact true.
For instance, if I'm playing with a sorting function and observe that the time is increasing roughly proportionally to the square of the input size, I might suspect that the complexity of this sort is O(n**2). But this does not constitute proof - in particular, some algorithms that perform well under typical inputs have pathological inputs that result in very poor performance.
To prove that the bound is in fact O(n**2), I need to look at what the algorithm is doing in the worst case - in this example, I might be analysing a selection sort, which repeatedly sweeps across the entire unsorted portion of the list and picks the lowest unsorted number. It should be evident that I'm examining something like n*(n-1) == O(n**2) elements. If examining elements is a constant-time operation, and placing the final element in the correct place is also not worse than O(n**2), then it follows that my entire algorithm is O(n**2).
If you're trying to get the big O notation for your own functions, you probably need variables keeping track of things like:
the runTime; the number of comparisons; the number of iterations; etc. As well as some calculation investigating how these correspond to the size of your data.
It's probably best to do this manually first, so you can check your understanding of an algorithm.

What makes sets faster than lists?

The python wiki says: "Membership testing with sets and dictionaries is much faster, O(1), than searching sequences, O(n). When testing "a in b", b should be a set or dictionary instead of a list or tuple."
I've been using sets in place of lists whenever speed is important in my code, but lately I've been wondering why sets are so much faster than lists. Could anyone explain, or point me to a source that would explain, what exactly is going on behind the scenes in python to make sets faster?
list: Imagine you are looking for your socks in your closet, but you don't know in which drawer your socks are, so you have to search drawer by drawer until you find them (or maybe you never do). That's what we call O(n), because in the worst scenario, you will look in all your drawers (where n is the number of drawers).
set: Now, imagine you're still looking for your socks in your closet, but now you know in which drawer your socks are, say in the 3rd drawer. So, you will just search in the 3rd drawer, instead of searching in all drawers. That's what we call O(1), because in the worst scenario you will look in just one drawer.
Sets are implemented using hash tables. Whenever you add an object to a set, the position within the memory of the set object is determined using the hash of the object to be added. When testing for membership, all that needs to be done is basically to look if the object is at the position determined by its hash, so the speed of this operation does not depend on the size of the set. For lists, in contrast, the whole list needs to be searched, which will become slower as the list grows.
This is also the reason that sets do not preserve the order of the objects you add.
Note that sets aren't faster than lists in general -- membership test is faster for sets, and so is removing an element. As long as you don't need these operations, lists are often faster.
I think you need to take a good look at a book on data structures. Basically, Python lists are implemented as dynamic arrays and sets are implemented as a hash tables.
The implementation of these data structures gives them radically different characteristics. For instance, a hash table has a very fast lookup time but cannot preserve the order of insertion.
While I have not measured anything performance related in python so far, I'd still like to point out that lists are often faster.
Yes, you have O(1) vs. O(n). But always remember that this gives information only about the asymptotic behavior of something. That means if your n is very high O(1) will always be faster - theoretically. In practice however n often needs to be much bigger than your usual data set will be.
So sets are not faster than lists per se, but only if you have to handle a lot of elements.
Python uses hashtables, which have O(1) lookup.
Basically, Depends on the operation you are doing …
*For adding an element - then a set doesn’t need to move any data, and all it needs to do is calculate a hash value and add it to a table. For a list insertion then potentially there is data to be moved.
*For deleting an element - all a set needs to do is remove the hash entry from the hash table, for a list it potentially needs to move data around (on average 1/2 of the data.
*For a search (i.e. an in operator) - a set just needs to calculate the hash value of the data item, find that hash value in the hash table, and if it is there - then bingo. For a list, the search has to look up each item in turn - on average 1/2 of all of the terms in the list. Even for many 1000s of items a set will be far quicker to search.
Actually sets are not faster than lists in every scenario. Generally the lists are faster than sets. But in the case of searching for an element in a collection, sets are faster because sets have been implemented using hash tables. So basically Python does not have to search the full set, which means that the time complexity in average is O(1). Lists use dynamic arrays and Python needs to check the full array to search. So it takes O(n).
So finally we can see that sets are better in some case and lists are better in some cases. Its up to us to select the appropriate data structure according to our task.
A list must be searched one by one, where a set or dictionary has an index for faster searching.

Categories

Resources