So I just learned about sorting algorithm s bubble, merge, insertion, sort etc. they all seem to be very similar in their methods of sorting with what seems to me minimal changes in their approach. So why do they produce such different sorting times ie O(n^2) vs O(nlogn) as an example
The "similarity" (?!) that you see is completely illusory.
The elementary, O(N squared), approaches, repeat their workings over and over, without taking any advantage, for the "next step", of any work done on the "previous step". So the first step takes time proportional to N, the second one to N-1, and so on -- and the resulting sum of integers from 1 to N is proportional to N squared.
For example, in selection sort, you are looking each time for the smallest element in the I:N section, where I is at first 0, then 1, etc. This is (and must be) done by inspecting all those elements, because no care was previously taken to afford any lesser amount of work on subsequent passes by taking any advantage of previous ones. Once you've found that smallest element, you swap it with the I-th element, increment I, and continue. O(N squared) of course.
The advanced, O(N log N), approaches, are cleverly structured to take advantage in following steps of work done in previous steps. That difference, compared to the elementary approaches, is so pervasive and deep, that, if one cannot perceive it, that speaks chiefly about the acuity of one's perception, not about the approaches themselves:-).
For example, in merge sort, you logically split the array into two sections, 0 to half-length and half-length to length. Once each half is sorted (recursively by the same means, until the length gets short enough), the two halves are merged, which itself is a linear sub-step.
Since you're halving every time, you clearly need a number of steps proportional to log N, and, as each step is O(N), obviously you get the very desirable O(N log N) as a result.
Python's "timsort" is a "natural mergesort", i.e, a variant of mergesort tuned to take advantage of already-sorted (or reverse-sorted) parts of the array, which it recognizes rapidly and avoids spending any further work on. This doesn't change big-O because that's about worst-case time -- but expected time crashes much further down because in so many real-life cases some partial sortedness is present.
(Note that, going by the rigid definition of big-O, quicksort isn't quick at all -- it's worst-case proportional to N squared, when you just happen to pick a terrible pivot each and every time... expected-time wise it's fine, though nowhere as good as timsort, because in real life the situations where you repeatedly pick a disaster pivot are exceedingly rare... but, worst-case, they might happen!-).
timsort is so good as to blow away even very experienced programmers. I don't count because I'm a friend of the inventor, Tim Peters, and a Python fanatic, so my bias is obvious. But, consider...
...I remember a "tech talk" at Google where timsort was being presented. Sitting next to me in the front row was Josh Bloch, then also a Googler, and Java expert extraordinaire. Less than mid-way through the talk he couldn't resist any more - he opened his laptop and started hacking to see if it could possibly be as good as the excellent, sharp technical presentation seemed to show it would be.
As a result, timsort is now also the sorting algorithm in recent releases of the Java Virtual Machine (JVM), though only for user-defined objects (arrays of primitives are still sorted the old way, quickersort [*] I believe -- I don't know which Java peculiarities determined this "split" design choice, my Java-fu being rather weak:-).
[*] that's essentially quicksort plus some hacks for pivot choice to try and avoid the poison cases -- and it's also what Python used to use before Tim Peters gave this one immortal contribution out of the many important ones he's made over the decades.
The results are sometimes surprising to people with CS background (like Tim, I have the luck of having a far-ago academic background, not in CS, but in EE, which helps a lot:-). Say, for example, that you must maintain an ever-growing array that is always sorted at any point in time, as new incoming data points must get added to the array.
The classic approach would use bisection, O(log N), to find the proper insertion point for each new incoming data point -- but then, to put the new data in the right place, you need to shift what comes after it by one slot, that's O(N).
With timsort, you just append the new data point to the array, then sort the array -- that's O(N) for timsort in this case (as it's so awesome in exploiting the already-sorted nature of the first N-1 items!-).
You can think of timsort as pushing the "take advantage of work previously done" to a new extreme -- where not only work previously done by the algorithm itself, but also other influences by other aspects of real-life data processing (causing segments to be sorted in advance), are all exploited to the hilt.
Then we could move into bucket sort and radix sort, which change the plane of discourse -- which in traditional sorting limits one to being able to compare two items -- by exploiting the items' internal structure.
Or a similar example -- presented by Bentley in his immortal book "Programming Pearls" -- of needing to sort an array of several million unique positive integers, each constrained to be 24 bits long.
He solved it with an auxiliary array of 16M bits -- just 2M bytes after all -- initially all zeroes: one pass through the input array to set the corresponding bits in the auxiliary array, then one pass through the auxiliary array to form the required integers again where 1s are found -- and bang, O(N) [and very speedy:-)] sorting for this special but important case!-)
Related
For the sake of the argument, consider following (very bad) sorting algorithm in python:
def so(ar):
while True:
le = len(ar)
switch = False
for y in range(le):
if y+1 == le:
break
if ar[y] > ar[y+1]:
ar[y],ar[y+1] = ar[y+1],ar[y]
switch = True
if switch == False:
break
return ar
I'm trying to understand the concept of "complexity of the algorithm" and there is one thing I don't get.
I came across the post that explains how to find the complexity of the algorithm here:
You add up how many machine instructions it will execute as a function
of the size of its input, and then simplify the expression to the
largest (when N is very large) term and can include any simplifying
constant factor.
But well, the problem is, that I cannot calculate how many machine instructions will be executed just
by knowing the length of the list.
Consider first example:
li = [random.randint(1,5000) for x in range(3000)]
start = time.time()
so(li)
end = time.time() - start
print(end)
Output: 2.96921706199646
Now have a look at the second example:
ok = [5000,43000,232] + [x for x in range(2997)]
start = time.time()
so(ok)
end = time.time() - start
print(end)
Output: 0.010689020156860352
We can see that the same sorting algorithm, two different lists, lists are the same length, and two completely different execution times.
When people are talking about algorithm complexity (big O notation) they normally assume that the only variable that determines complexity of the algo is the size of the input, but clearly, in the example above it is not the case. It is not only the size of the list, but also the positioning of each value within the list that determines the speed of the sorting.
So my question is, why do we only consider size of input when estimating complexity?
And, if it is possible, can you tell me what the complexity of the algorithm above will be?
You're correct, complexity doesn't only depend on N. That's why you'll often see indications about average, worst and best cases.
Timsort is used in Python because it's (O n log n) on average, still fast for worst-cases (O(n log n)) and extremely fast for best-cases (O(n), when the list is already sorted).
Quicksort also has an average complexity of O(n log n), but its worst case is O(n²), when the list is already sorted. This use case happens very often, so it might be worth it to actually shuffle the list before sorting it!
why do we only consider size of input when estimating complexity?
In the narrow sense of complexity as of the use of Big O notation in computer science, it is simply by definition:
In computer science, big O notation is used to classify algorithms according to how their running time or space requirements grow as the input size grows.
In the broader sense your question could be interpreted as "why do we use Big O notation to describe algorithm complexity when the nature of the data can be just as important as its size."
The answer here lies in the fact that algorithm development is often done on small datasets to make it easy, while in the real world the datasets are huge. When you are writing your sorting function you're most likely going to try it first on small lists of random data. You'd want the result small enough that you can verify that it worked by simply looking at the result...
The time complexity is not always definitely dependent on size of input. When we look at randomized sorting algorithms, the input patterns might play a significant role in determining time complexity.
We usually calculate time complexity in terms of worst, good and average case and could particularly study time complexity in terms of specific input order/patterns which could lead to good, average and best case time complexity.
For example, in first case provided by you, since input is randomized, there is 1/n! probability for a particular input to occur. The good case (when the list is sorted already) is Ω(n) and the worst case(when the list is reversely sorted) is O(n²) , but the probability is low for best or worst case to occur.
Therefore, the sorting algorithm has θ(n²) average time complexity since the probability of comparison and swap in case of two elements in average case input is high due to random distribution of numbers.
In the second case, the order is strict which means high probability for input to tend toward best case or worst case time complexity . In your case, input is more tending towards good case, therefore lesser time.
What space complexity does the python sort take? I can't find any definitive documentation on this anywhere
Space complexity is defined as how much additional space the algorithm needs in terms of the N elements. And even though according to the docs, the sort method sorts a list in place, it does use some additional space, as stated in the description of the implementation:
timsort can require a temp array containing as many as N//2 pointers, which means as many as 2*N extra bytes on 32-bit boxes. It can be expected to require a temp array this large when sorting random data; on data with significant structure, it may get away without using any extra heap memory.
Therefore the worst case space complexity is O(N) and best case O(1)
Python's built in sort method is a spin off of merge sort called Timsort, more information here - https://en.wikipedia.org/wiki/Timsort.
It's essentially no better or worse than merge sort, which means that its run time on average is O(n log n) and its space complexity is Ω(n)
Each sample is an array of features (ints). I need to split my samples into two separate groups by figuring out what the best feature, and the best splitting value for that feature, is. By "best", I mean the split that gives me the greatest entropy difference between the pre-split set and the weighted average of the entropy values on the left and right sides. I need to try all (2^m−2)/2 possible ways to partition these items into two nonempty lists (where m is the number of distinct values (all samples with the same value for that feature are moved together as a group))
The following is extremely slow so I need a more reasonable/ faster way of doing this.
sorted_by_feature is a list of (feature_value, 0_or_1) tuples.
same_vals = {}
for ele in sorted_by_feature:
if ele[0] not in same_vals:
same_vals[ele[0]] = [ele]
else:
same_vals[ele[0]].append(ele)
l = same_vals.keys()
orderings = list(itertools.permutations(l))
for ordering in orderings:
list_tups = []
for dic_key in ordering:
list_tups += same_vals[dic_key]
left_1 = 0
left_0 = 0
right_1 = num_one
right_0 = num_zero
for index, tup in enumerate(list_tups):
#0's or #1's on the left +/- 1
calculate entropy on left/ right, calculate entropy drop, etc.
Trivial details (continuing the code above):
if index == len(sorted_by_feature) -1:
break
if tup[1] == 1:
left_1 += 1
right_1 -= 1
if tup[1] == 0:
left_0 += 1
right_0 -= 1
#only calculate entropy if values to left and right of split are different
if list_tups[index][0] != list_tups[index+1][0]:
tl;dr
You're asking for a miracle. No programming language can help you out of this one. Use better approaches than what you're considering doing!
Your Solution has Exponential Time Complexity
Let's assume a perfect algorithm: one that can give you a new partition in constant O(1) time. In other words, no matter what the input, a new partition can be generated in a guaranteed constant amount of time.
Let's in fact go one step further and assume that your algorithm is only CPU-bound and is operating under ideal conditions. Under ideal circumstances, a high-end CPU can process upwards of 100 billion instructions per second. Since this algorithm takes O(1) time, we'll say, oh, that every new partition is generated in a hundred billionth of a second. So far so good?
Now you want this to perform well. You say you want this to be able to handle an input of size m. You know that that means you need about pow(2,m) iterations of your algorithm - that's the number of partitions you need to generate, and since generating each algorithm takes a finite amount of time O(1), the total time is just pow(2,m) times O(1). Let's take a quick look at the numbers here:
m = 20 means your time taken is pow(2,20)*10^-11 seconds = 0.00001 seconds. Not bad.
m = 40 means your time taken is pow(2,40)10-11 seconds = 1 trillion/100 billion = 10 seconds. Also not bad, but note how small m = 40 is. In the vast panopticon of numbers, 40 is nothing. And remember we're assuming ideal conditions.
m = 100 means 10^41 seconds! What happened?
You're a victim of algorithmic theory. Simply put, a solution that has exponential time complexity - any solution that takes 2^m time to complete - cannot be sped up by better programming. Generating or producing pow(2,m) outputs is always going to take you the same proportion of time.
Note further that 100 billion instructions/sec is an ideal for high-end desktop computers - your CPU also has to worry about processes other than this program you're running, in which case kernel interrupts and context switches eat into processing time (especially when you're running a few thousand system processes, which you no doubt are). Your CPU also has to read and write from disk, which is I/O bound and takes a lot longer than you think. Interpreted languages like Python also eat into processing time since each line is dynamically converted to bytecode, forcing additional resources to be devoted to that. You can benchmark your code right now and I can pretty much guarantee your numbers will be way higher than the simplistic calculations I provide above. Even worse: storing 2^40 permutations requires 1000 GBs of memory. Do you have that much to spare? :)
Switching to a lower-level language, using generators, etc. is all a pointless affair: they're not the main bottleneck, which is simply the large and unreasonable time complexity of your brute force approach of generating all partitions.
What You Can Do Instead
Use a better algorithm. Generating pow(2,m) partitions and investigating all of them is an unrealistic ambition. You want, instead, to consider a dynamic programming approach. Instead of walking through the entire space of possible partitions, you want to only consider walking through a reduced space of optimal solutions only. That is what dynamic programming does for you. An example of it at work in a problem similar to this one: unique integer partitioning.
Dynamic programming problems approaches work best on problems that can be formulated as linearized directed acyclic graphs (Google it if not sure what I mean!).
If a dynamic approach is out, consider investing in parallel processing with a GPU instead. Your computer already has a GPU - it's what your system uses to render graphics - and GPUs are built to be able to perform large numbers of calculations in parallel. A parallel calculation is one in which different workers can do different parts of the same calculation at the same time - the net result can then be joined back together at the end. If you can figure out a way to break this into a series of parallel calculations - and I think there is good reason to suggest you can - there are good tools for GPU interfacing in Python.
Other Tips
Be very explicit on what you mean by best. If you can provide more information on what best means, we folks on Stack Overflow might be of more assistance and write such an algorithm for you.
Using a bare-metal compiled language might help reduce the amount of real time your solution takes in ordinary situations, but the difference in this case is going to be marginal. Compiled languages are useful when you have to do operations like searching through an array efficiently, since there's no instruction-compilation overhead at each iteration. They're not all that more useful when it comes to generating new partitions, because that's not something that removing the dynamic bytecode generation barrier actually affects.
A couple of minor improvements I can see:
Use try/catch instead of if not in to avoid double lookup of keys
if ele[0] not in same_vals:
same_vals[ele[0]] = [ele]
else:
same_vals[ele[0]].append(ele)
# Should be changed to
try:
same_vals[ele[0]].append(ele) # Most of the time this will work
catch KeyError:
same_vals[ele[0]] = [ele]
Dont explicitly convert a generator to a list if you dont have to. I dont immediately see any need for your casting to a list, which would slow things down
orderings = list(itertools.permutations(l))
for ordering in orderings:
# Should be changed to
for ordering in itertools.permutations(l):
But, like I said, these are only minor improvements. If you really needed this to be faster, consider using a different language.
(Python 2.7.8 Windows)
I'm doing a comparison between different sorting algorithms (Quick, bubble and insertion), and mostly it's going as expected, Quick sort is considerably faster with long lists and bubble and insertion are faster with very short lists and alredy sorted ones.
What's raising a problem is Quick Sort and the before mentioned "already sorted" lists. I can sort lists of even 100000 items without problems with this, but with lists of integers from 0...n the limit seems to be considerably lower. 0...500 works but even 0...1000 gives:
RuntimeError: maximum recursion depth exceeded in cmp
Quick Sort:
def quickSort(myList):
if myList == []:
return []
else:
pivot = myList[0]
lesser = quickSort([x for x in myList[1:] if x < pivot])
greater = quickSort([x for x in myList[1:] if x >= pivot])
myList = lesser + [pivot] + greater
return myList
Is there something wrong with the code, or am I missing something?
There are two things going on.
First, Python intentionally limits recursion to a fixed depth. Unlike, say, Scheme, which will keep allocating frames for recursive calls until you run out of memory, Python (at least the most popular implementation, CPython) will only allocate sys.getrecursionlimit() frames (defaulting to 1000) before failing. There are reasons for that,* but really, that isn't relevant here; just the fact that it does this is what you need to know about.
Second, as you may already know, while QuickSort is O(N log N) with most lists, it has a worst case of O(N^2)—in particular (using the standard pivot rules) with already-sorted lists. And when this happens, your stack depth can end up being O(N). So, if you have 1000 elements, arranged in worst-case order, and you're already one frame into the stack, you're going to overflow.
You can work around this in a few ways:
Rewrite the code to be iterative, with an explicit stack, so you're only limited by heap memory instead of stack depth.
Make sure to always recurse into the shorter side first, rather than the left side. This means that even in the O(N^2) case, your stack depth is still O(log N). But only if you've already done the previous step.**
Use a random, median-of-three, or other pivot rule that makes common cases not like already-sorted worst-case. (Of course someone can still intentionally DoS your code; there's really no way to avoid that with quicksort.) The Wikipedia article has some discussion on this, and links to the classic Sedgewick and Knuth papers.
Use a Python implementation with an unlimited stack.***
sys.setrecursionlimit(max(sys.getrecursionlimit(), len(myList)+CONSTANT)). This way, you'll fail right off the bat for an obvious reason if you can't make enough space, and usually won't fail otherwise. (But you might—you could be starting the sort already 900 steps deep in the stack…) But this is a bad idea.****. Besides, you have to figure out the right CONSTANT, which is impossible in general.*****
* Historically, the CPython interpreter recursively calls itself for recursive Python function calls. And the C stack is fixed in size; if you overrun the end, you could segfault, stomp all over heap memory, or all kinds of other problems. This could be changed—in fact, Stackless Python started off as basically just CPython with this change. But the core devs have intentionally chosen not to do so, in part because they don't want to encourage people to write deeply recursive code.
** Or if your language does automatic tail call elimination, but Python doesn't do that. But, as gnibbler points out, you can write a hybrid solution—recurse on the small end, then manually unwrap the tail recursion on the large end—that won't require an explicit stack.
*** Stackless and PyPy can both be configured this way.
**** For one thing, eventually you're going to crash the C stack.
***** The constant isn't really constant; it depends on how deep you already are in the stack (computable non-portably by walking sys._getframe() up to the top) and how much slack you need for comparison functions, etc. (not computable at all, you just have to guess).
You're choosing the first item of each sublist as the pivot. If the list is already in order, this means that your greater sublist is all the items but the first, rather than about half of them. Essentially, each recursive call manages to process only one item. Which means the depth of recursive calls you'll need to make will be about the same as the number of items in the full list. Which overflows Python's built-in limit once you hit about 1000 items. You will have a similar problem sorting lists that are already in reversed order.
To correct this use one of the workarounds suggested in the literature, such as choosing an item at random to be the pivot or the median of the first, middle, and last items.
Always choosing the first (or last) element as the pivot will have problems for quicksort - worst case performance for some common inputs as you have seen
One technique that works fairly well is to choose the average of first,middle and last element
You don't want to make the pivot selection too complicated, or it will dominate the runtime of the search
This question already has answers here:
Python Time Complexity (run-time)
(6 answers)
Closed 2 years ago.
When required to show how efficient the algorithm is, we need to show the algorithmic complexity of functions - Big O and so on. In Python code, how can we show or calculate the bounds of functions?
In general, there's no way to do this programmatically (you run into the halting problem).
If you have no idea where to start, you can gain some insight into how a function will perform by running some benchmarks (e.g. using the time module) with inputs of various sizes. You can even collect enough data to form a suspicion about what the runtime might be. But this won't give you a rigorous answer - for that, you need to prove mathematically that your suspected bound is in fact true.
For instance, if I'm playing with a sorting function and observe that the time is increasing roughly proportionally to the square of the input size, I might suspect that the complexity of this sort is O(n**2). But this does not constitute proof - in particular, some algorithms that perform well under typical inputs have pathological inputs that result in very poor performance.
To prove that the bound is in fact O(n**2), I need to look at what the algorithm is doing in the worst case - in this example, I might be analysing a selection sort, which repeatedly sweeps across the entire unsorted portion of the list and picks the lowest unsorted number. It should be evident that I'm examining something like n*(n-1) == O(n**2) elements. If examining elements is a constant-time operation, and placing the final element in the correct place is also not worse than O(n**2), then it follows that my entire algorithm is O(n**2).
If you're trying to get the big O notation for your own functions, you probably need variables keeping track of things like:
the runTime; the number of comparisons; the number of iterations; etc. As well as some calculation investigating how these correspond to the size of your data.
It's probably best to do this manually first, so you can check your understanding of an algorithm.