I was looking through some forums for new questions to solve, and found this one:
Given an array of scores, and an integer k. Player with the same score will have the same rank, and the rank of the player is "the number of players with higher score" + 1. For instance, given scores = [10, 20, 20, 40], the corresponding rank is [4, 2, 2, 1]. Only players with a rank <= k can qualify for the next round. Return the number of player that qualify for the next round.
I have come up with a few ways to solve it, and it seems the best time complexity I can get is O(nlog(n)) with the following algorithm:
sort the array, which has time complexity O(nlogn)
then, start with rank = 1, and update it each time we pass to a lower score, so while rank <= k, keep adding in the amount of players that qualify, and this has time complexity O(n), since we may end up iterating through the whole array.
return the final count
another idea was to create some hashtable that holds the score as the key, and has the number of players as it's value:
iterate through the array and if we find someone with a certain score, then add in another player in the value for that entry in the hashtable, and also if the score we come across is larger than the smallest score in our hashtable, remove the smallest score entry, and put in the new score (so, by the end we only have the top k scores)
then add together all the values in the resulting hashtable (or, at least add together the relevant entries, as the top k scores does not necessarily mean these are the top k ranked players, but we know that only the top k scores are needed, at max, to find the amount that qualify)
This seems to have time complexity O(nk), because we need to iterate through the whole array, but each time check against the min of the current k scores we have, to ensure that we are only keeping the top k scores. This will usually take longer than O(nlogn).
However, I feel there must be an even better way then the methods I have come up with. Does anyone have any advice?
Here is the original forum post: https://leetcode.com/discuss/interview-question/1362837/goldman-sachs-new-analyst-2022-oa
Another idea is as follows:
Create a frequency table that counts the number of players for each score. This is similar to the hashtable idea you mentioned in your post. The keys are unique scores and values are the number of players for that particular score.
Using a min heap push the keys of the frequency table to the heap. As soon as the length of the heap becomes equal to k, for each new push to the heap, pop one from the heap. This guarantees that you end up with the k largest scores in the heap at the end.
Now, loop over the elements in the heap (without popping) which are keys to the freq table, and sum the number of players with those keys in the table.
Time complexity-wise we have run over the initial array in O(n) to create the freq table, we have pushed and popped the number of distinct scores from a heap and since the number of distinct scores is n in the worst case this makes it O(n * log k) operations. Notice that since the heap never grows over k it's log k and not log n. At the end we have looped over the k elements in the heap and summed their values from the freq table which is k operations.
So, this becomes n + (n * log k) + k which reduces to O(n * log k) in big O terms.
This is a small variant on the selection problem: you're looking for the kth smallest element in a list, and the answer you need to output is the number of values less than or equal to the kth element's value. There are many possible solutions, but the standard one is Quickselect, which can give an answer in linear O(n) time. Let's look at a variety of increasingly efficient approaches for the standard selection problem and see their runtime:
Sort the numbers, and count the k smallest: Runtime: O(n log n).
Keep a min-heap, with size bounded at k. Iterate over the array, pushing each value into the heap, and popping whenever the size reaches k+1. Runtime: O(n log k)
Min-Heapify the entire array, and pop k times. Sometimes called 'heapselect'. Runtime: O(n + k log n)
Quickselect. With randomized pivot selection, it has O(n) expected run-time and O(n^2) worst-case runtime, with good average performance. With median-of-medians pivot selection, it has O(n) worst-case runtime, with a higher constant factor.
If you look at the C++ standard library, in the algorithms header, this selection function is called nth element. In practice, variants of quickselect are often used, for example introselect or randomized-quickselect with heapselect fallback, which try to retain randomized QuickSelect's good average performance but without an O(n^2) worst-case.
Related
For example:
T = [b, c, b, b, a, c] #each element represents questions related to topics a, b, c
What I want is make sure that no 2 questions from same topic are next to one another. (see the T where b, b are next to eachother)
So I want to rearrange the T in such a that no two questions belonging to same topic are next to each other, i.e. Tnew = [b, c, b, a, b, c]
But condition is that we have to do it linear time i.e. O(n) or Big-O of (n)
The algorithm that I thought of:
1)Create a dict or map to hold the occurrence of each topics:
a --> 1
b --> 3
c --> 2
2) Now based on the counts we can create new array such that:
A = [a, b, b, b, c, c]
3) Now perform unsorting of Array which I believe runs in O(n).
(unsorting is basically find midpoint and then merge the elements alternately from each half.
Can someone please help me design a pseudocode or algorithm that can do this better on any inputs with k number of topics?
This is a random question that I am practing for exam.
There is a linearithmic approach of time complexity O(log c * n) where c is the number of unique items and n is the total number of items.
It works as follows:
Create a frequency table of the items. This is just a list of tuples that tells how many of each item are available. Let's store the tuples in the list as (quantity, item). This Step is O(n). You could use collections.Counter in python, or collections.defaultdict(int) or a vanilla dict.
Heapify the list of tuples in a max heap. This can be done in O(n). This heap has the items with the largest number of quantity at the front. You could use heapq module in python. Let's call this heap hp.
Have a list for the results called res.
Now run a loop while len(hp) > 0: and do as follows:
pop the 2 largest elements from the heap. O(log c) operation.
add one from each to res. Make sure you handle edge cases properly, if any.
decrement the quantity of both items. If their quantity > 0 push them back on the heap. O(log c) operation.
At the end, you could end with one item that has no peers for inter weaving them. This can happen if the quantity of one item is larger than the sum of the quantities of all the other items. But there's no way around this. Your input data must respect this condition.
One final note about time complexity: If the number of unique items is constant, we could drop the log c factor from the time complexity and consider it linear. This is mainly a case of how we define things.
Here's the O(n) solution (inspired in part by #user1984's answer):
Imagine you know how many of each element to insert, and have ordered these counts. Say we then decided to build up a solution by interleaving groups of elements incrementally. We start off with just our group of elements G0 with lowest frequency. Then, we take the next most popular group G1, and interleave these values into our existing list.
If we were to continue in this fashion, we could observe a few rules:
if the current group G has more elements than all over smaller groups combined plus one, then:
the result will have elements of G neighboring each other
regardless of its prior state, no elements from the smaller groups will neighbor each other (inter-group nor intra-group)
otherwise
the result will have no elements of G neighboring each other
regardless, G contains enough elements to separate individual elements of the next smallest group G-1, if positioned wisely.
With this in mind, we can see (recursively) that as long as we shift our interleaving to overlap any outstanding neighbor violations, we can guarantee that as long as G has fewer elements than the smaller groups combined plus two, the overall result absolutely will not have any neighbor violations.
Of course, the logic I outlined above for computing this poses some performance issues, so we're going to compute the same result in a slightly different, but equivalent, way. We'll instead insert the largest group first, and work our way smaller.
First, create a frequency table. You need to form a item: quantity mapping, so something like collections.Counter() in python. This takes O(N) time.
Next, order this mapping by frequency. This can be done in O(N) time using counting sort, since all elements are integers. Note there's c elements, but c <= N, so O(c) is O(N).
After that, build a linked list of length N, with node values from [0, N) (ascending). This will help us track which indices to write into next.
For each item in our ordered mapping, iterate from 0 to the associated count (exclusive). Each iteration, remove the current element from linked list ((re)starting at the head), and traverse two nodes forward in the linked list. Insert the item/number into the destination array at the index of each removed node. This will take O(N) time since we traverse ~2k nodes per item (where k in the group size), and the combined size of all groups is N, so 2N traversals. Each removal can be performed in O(1) time, so O(N) for traversing and removing.
So all in all, this will take O(N) time, utilizing linked lists, hash tables (or O(1) access mappings of some sort), and counting sort.
The collections module can help. Using Counter gets you a number of occurrence of each question in O(n) time. Converting those into iterators in a deque will allow you to interleave the questions sequentially but you need to process them in decreasing order of occurrences. Getting the counters in order of frequency would normally require a sort, which is O(nLogn), but you can use a pigeon hole approach to group the iterators by common frequency in O(n) time then go through the groups in reverse order of frequency. The maximum number of distinct frequencies is knownable and will be less or equal to √(0.25+2n)-0.5 which is less than O(n).
There will be at most n iterators so building the deque will be <= O(n). going through the iterators until exhaustion will take at most 2n iterations:
T= ['b', 'c', 'b', 'b', 'a', 'c']
from collections import Counter,deque
from itertools import repeat
result = []
counts = Counter(T) # count occurrences O(n)
common = [[] for _ in range(max(counts.values()))] # frequency groups
for t,n in counts.items():
common[n-1].append(repeat(t,n)) # iterators by freq.
Q = deque(iq for cq in reversed(common) for iq in cq) # queue iterators
while Q: # 2n iterations or less
iq = Q.popleft() # O(1) - extract first iterator
q = next(iq,None) # O(1) - next question
if not q: continue # - exhausted, removed from deque
result.append(q) # O(1) - add question to result
Q.insert(1,iq) # O(1) - put back iterator as 2nd
print(result)
['b', 'c', 'b', 'c', 'b', 'a']
I am trying to solve this problem on Hackerrank:https://www.hackerrank.com/challenges/climbing-the-leaderboard. The problem statement basically states that there are two set of scores one of players and other of Alice and we have to use dense ranking and display Alice's rank when compared to other player's scores. It is giving me Time-Out error on large test-cases. I have used the forum suggestions on Hackerrank already and was successful, but specifically I am curious to know the problem in my code. Here is my code:
class Dict(dict):
def __init__(self):
self=dict()
def add(self,key,value):
self[key]=value
def climbingLeaderboard(scores, alice):
alice_rank=[]
for i in range(len(alice)):
scores.append(alice[i])
a=list(set(scores))
a.sort(reverse=True)
obj=Dict()
b=1
for j in a:
obj.add(j,b)
b+=1
if alice[i] in obj:
alice_rank.append(obj[alice[i]])
scores.remove(alice[i])
return alice_rank
You have a couple of problems in your code but the most important one is the following.
...
scores.append(alice[i])
a=list(set(scores))
a.sort(reverse=True)
...
On each iteration you add Alice's score to scores and then sort scores. The cost here is already O(nlog(n)), where n - number of elements in scores. Thus, your total time complexity becomes O(n*n*log(n)). That's too much because n can reach 200000 and so for your solution it can be up to 200000*200000*log(200000) operations.
Of course, there's another problem:
...
for j in a:
obj.add(j,b)
b+=1
...
But it's still not as bad as the previous one since the loop time complexity is O(n).
There exists a O(n*log(n)) time complexity solution. I'll give you an overall idea so that you can easily implement it yourself.
If you recall that players with duplicate scores share the same position in the leaderboard then you can convert your scores to an array without duplicates as list(set(scores)) before your loop. In that case, the first position corresponds to the highest score, the second one to the second highest score and so on (the initial array is sorted in decreasing order per problem statement).
Given the step above, for each score of Alice you can find a position in the array at which the player's score is less or equal to score. The lookup will take O(log(n)) because the array is sorted. For instance, if scores of players are 40, 30, 10 and a score of Alice is 35, then the found position will be 2 (for the algorithm description I consider that the first index starts from 1) as 30 occupies this position, This position is the ACTUAL position of Alice in the leaderboard and so can be printed right away.
Another tip - you can use bisect module for performing a binary search in the array.
So, the overall time complexity of the proposed solution is O(n*log(n)). It will pass all the test cases (I've tried it).
Performing repeated sort (a.sort(reverse=True)) consumes a lot of time. I had the same problem. If you read the question, you will find that the scores are input in sorted (ascending or descending). The trick is to exploit this inherent ordering of input.
One more thing, your code's time complexity is O(n^2) due to nested loop, whereas the forum you spoke if may be doing it with O(n) (not sure).
Can someone help explain how can building a heap be O(n) complexity?
Inserting an item into a heap is O(log n), and the insert is repeated n/2 times (the remainder are leaves, and can't violate the heap property). So, this means the complexity should be O(n log n), I would think.
In other words, for each item we "heapify", it has the potential to have to filter down (i.e., sift down) once for each level for the heap so far (which is log n levels).
What am I missing?
I think there are several questions buried in this topic:
How do you implement buildHeap so it runs in O(n) time?
How do you show that buildHeap runs in O(n) time when implemented correctly?
Why doesn't that same logic work to make heap sort run in O(n) time rather than O(n log n)?
How do you implement buildHeap so it runs in O(n) time?
Often, answers to these questions focus on the difference between siftUp and siftDown. Making the correct choice between siftUp and siftDown is critical to get O(n) performance for buildHeap, but does nothing to help one understand the difference between buildHeap and heapSort in general. Indeed, proper implementations of both buildHeap and heapSort will only use siftDown. The siftUp operation is only needed to perform inserts into an existing heap, so it would be used to implement a priority queue using a binary heap, for example.
I've written this to describe how a max heap works. This is the type of heap typically used for heap sort or for a priority queue where higher values indicate higher priority. A min heap is also useful; for example, when retrieving items with integer keys in ascending order or strings in alphabetical order. The principles are exactly the same; simply switch the sort order.
The heap property specifies that each node in a binary heap must be at least as large as both of its children. In particular, this implies that the largest item in the heap is at the root. Sifting down and sifting up are essentially the same operation in opposite directions: move an offending node until it satisfies the heap property:
siftDown swaps a node that is too small with its largest child (thereby moving it down) until it is at least as large as both nodes below it.
siftUp swaps a node that is too large with its parent (thereby moving it up) until it is no larger than the node above it.
The number of operations required for siftDown and siftUp is proportional to the distance the node may have to move. For siftDown, it is the distance to the bottom of the tree, so siftDown is expensive for nodes at the top of the tree. With siftUp, the work is proportional to the distance to the top of the tree, so siftUp is expensive for nodes at the bottom of the tree. Although both operations are O(log n) in the worst case, in a heap, only one node is at the top whereas half the nodes lie in the bottom layer. So it shouldn't be too surprising that if we have to apply an operation to every node, we would prefer siftDown over siftUp.
The buildHeap function takes an array of unsorted items and moves them until they all satisfy the heap property, thereby producing a valid heap. There are two approaches one might take for buildHeap using the siftUp and siftDown operations we've described.
Start at the top of the heap (the beginning of the array) and call siftUp on each item. At each step, the previously sifted items (the items before the current item in the array) form a valid heap, and sifting the next item up places it into a valid position in the heap. After sifting up each node, all items satisfy the heap property.
Or, go in the opposite direction: start at the end of the array and move backwards towards the front. At each iteration, you sift an item down until it is in the correct location.
Which implementation for buildHeap is more efficient?
Both of these solutions will produce a valid heap. Unsurprisingly, the more efficient one is the second operation that uses siftDown.
Let h = log n represent the height of the heap. The work required for the siftDown approach is given by the sum
(0 * n/2) + (1 * n/4) + (2 * n/8) + ... + (h * 1).
Each term in the sum has the maximum distance a node at the given height will have to move (zero for the bottom layer, h for the root) multiplied by the number of nodes at that height. In contrast, the sum for calling siftUp on each node is
(h * n/2) + ((h-1) * n/4) + ((h-2)*n/8) + ... + (0 * 1).
It should be clear that the second sum is larger. The first term alone is hn/2 = 1/2 n log n, so this approach has complexity at best O(n log n).
How do we prove the sum for the siftDown approach is indeed O(n)?
One method (there are other analyses that also work) is to turn the finite sum into an infinite series and then use Taylor series. We may ignore the first term, which is zero:
If you aren't sure why each of those steps works, here is a justification for the process in words:
The terms are all positive, so the finite sum must be smaller than the infinite sum.
The series is equal to a power series evaluated at x=1/2.
That power series is equal to (a constant times) the derivative of the Taylor series for f(x)=1/(1-x).
x=1/2 is within the interval of convergence of that Taylor series.
Therefore, we can replace the Taylor series with 1/(1-x), differentiate, and evaluate to find the value of the infinite series.
Since the infinite sum is exactly n, we conclude that the finite sum is no larger, and is therefore, O(n).
Why does heap sort require O(n log n) time?
If it is possible to run buildHeap in linear time, why does heap sort require O(n log n) time? Well, heap sort consists of two stages. First, we call buildHeap on the array, which requires O(n) time if implemented optimally. The next stage is to repeatedly delete the largest item in the heap and put it at the end of the array. Because we delete an item from the heap, there is always an open spot just after the end of the heap where we can store the item. So heap sort achieves a sorted order by successively removing the next largest item and putting it into the array starting at the last position and moving towards the front. It is the complexity of this last part that dominates in heap sort. The loop looks likes this:
for (i = n - 1; i > 0; i--) {
arr[i] = deleteMax();
}
Clearly, the loop runs O(n) times (n - 1 to be precise, the last item is already in place). The complexity of deleteMax for a heap is O(log n). It is typically implemented by removing the root (the largest item left in the heap) and replacing it with the last item in the heap, which is a leaf, and therefore one of the smallest items. This new root will almost certainly violate the heap property, so you have to call siftDown until you move it back into an acceptable position. This also has the effect of moving the next largest item up to the root. Notice that, in contrast to buildHeap where for most of the nodes we are calling siftDown from the bottom of the tree, we are now calling siftDown from the top of the tree on each iteration! Although the tree is shrinking, it doesn't shrink fast enough: The height of the tree stays constant until you have removed the first half of the nodes (when you clear out the bottom layer completely). Then for the next quarter, the height is h - 1. So the total work for this second stage is
h*n/2 + (h-1)*n/4 + ... + 0 * 1.
Notice the switch: now the zero work case corresponds to a single node and the h work case corresponds to half the nodes. This sum is O(n log n) just like the inefficient version of buildHeap that is implemented using siftUp. But in this case, we have no choice since we are trying to sort and we require the next largest item be removed next.
In summary, the work for heap sort is the sum of the two stages: O(n) time for buildHeap and O(n log n) to remove each node in order, so the complexity is O(n log n). You can prove (using some ideas from information theory) that for a comparison-based sort, O(n log n) is the best you could hope for anyway, so there's no reason to be disappointed by this or expect heap sort to achieve the O(n) time bound that buildHeap does.
Your analysis is correct. However, it is not tight.
It is not really easy to explain why building a heap is a linear operation, you should better read it.
A great analysis of the algorithm can be seen here.
The main idea is that in the build_heap algorithm the actual heapify cost is not O(log n)for all elements.
When heapify is called, the running time depends on how far an element might move down in the tree before the process terminates. In other words, it depends on the height of the element in the heap. In the worst case, the element might go down all the way to the leaf level.
Let us count the work done level by level.
At the bottommost level, there are 2^(h)nodes, but we do not call heapify on any of these, so the work is 0. At the next level there are 2^(h − 1) nodes, and each might move down by 1 level. At the 3rd level from the bottom, there are 2^(h − 2) nodes, and each might move down by 2 levels.
As you can see not all heapify operations are O(log n), this is why you are getting O(n).
Intuitively:
"The complexity should be O(nLog n)... for each item we "heapify", it has the potential to have to filter down once for each level for the heap so far (which is log n levels)."
Not quite. Your logic does not produce a tight bound -- it over estimates the complexity of each heapify. If built from the bottom up, insertion (heapify) can be much less than O(log(n)). The process is as follows:
( Step 1 ) The first n/2 elements go on the bottom row of the heap. h=0, so heapify is not needed.
( Step 2 ) The next n/22 elements go on the row 1 up from the bottom. h=1, heapify filters 1 level down.
( Step i )
The next n/2i elements go in row i up from the bottom. h=i, heapify filters i levels down.
( Step log(n) ) The last n/2log2(n) = 1 element goes in row log(n) up from the bottom. h=log(n), heapify filters log(n) levels down.
NOTICE: that after step one, 1/2 of the elements (n/2) are already in the heap, and we didn't even need to call heapify once. Also, notice that only a single element, the root, actually incurs the full log(n) complexity.
Theoretically:
The Total steps N to build a heap of size n, can be written out mathematically.
At height i, we've shown (above) that there will be n/2i+1 elements that need to call heapify, and we know heapify at height i is O(i). This gives:
The solution to the last summation can be found by taking the derivative of both sides of the well known geometric series equation:
Finally, plugging in x = 1/2 into the above equation yields 2. Plugging this into the first equation gives:
Thus, the total number of steps is of size O(n)
There are already some great answers but I would like to add a little visual explanation
Now, take a look at the image, there are
n/2^1 green nodes with height 0 (here 23/2 = 12)
n/2^2 red nodes with height 1 (here 23/4 = 6)
n/2^3 blue node with height 2 (here 23/8 = 3)
n/2^4 purple nodes with height 3 (here 23/16 = 2)
so there are n/2^(h+1) nodes for height h
To find the time complexity lets count the amount of work done or max no of iterations performed by each node
now it can be noticed that each node can perform(atmost) iterations == height of the node
Green = n/2^1 * 0 (no iterations since no children)
red = n/2^2 * 1 (heapify will perform atmost one swap for each red node)
blue = n/2^3 * 2 (heapify will perform atmost two swaps for each blue node)
purple = n/2^4 * 3 (heapify will perform atmost three swaps for each purple node)
so for any nodes with height h maximum work done is n/2^(h+1) * h
Now total work done is
->(n/2^1 * 0) + (n/2^2 * 1)+ (n/2^3 * 2) + (n/2^4 * 3) +...+ (n/2^(h+1) * h)
-> n * ( 0 + 1/4 + 2/8 + 3/16 +...+ h/2^(h+1) )
now for any value of h, the sequence
-> ( 0 + 1/4 + 2/8 + 3/16 +...+ h/2^(h+1) )
will never exceed 1
Thus the time complexity will never exceed O(n) for building heap
It would be O(n log n) if you built the heap by repeatedly inserting elements. However, you can create a new heap more efficiently by inserting the elements in arbitrary order and then applying an algorithm to "heapify" them into the proper order (depending on the type of heap of course).
See http://en.wikipedia.org/wiki/Binary_heap, "Building a heap" for an example. In this case you essentially work up from the bottom level of the tree, swapping parent and child nodes until the heap conditions are satisfied.
As we know the height of a heap is log(n), where n is the total number of elements.Lets represent it as h
When we perform heapify operation, then the elements at last level(h) won't move even a single step.
The number of elements at second last level(h-1) is 2h-1 and they can move at max 1 level(during heapify). Similarly, for the ith, level we have 2i elements which can move h-i positions.
Therefore total number of moves:
S= 2h*0+2h-1*1+2h-2*2+...20*h
S=2h {1/2 + 2/22 + 3/23+ ... h/2h} -------------------------------------------------1
this is AGP series, to solve this divide both sides by 2
S/2=2h {1/22 + 2/23+ ... h/2h+1} -------------------------------------------------2
subtracting equation 2 from 1 gives
S/2=2h {1/2+1/22 + 1/23+ ...+1/2h+ h/2h+1}
S=2h+1 {1/2+1/22 + 1/23+ ...+1/2h+ h/2h+1}
Now 1/2+1/22 + 1/23+ ...+1/2h is decreasing GP whose sum is less than 1 (when h tends to infinity, the sum tends to 1). In further analysis, let's take an upper bound on the sum which is 1.
This gives:
S=2h+1{1+h/2h+1} =2h+1+h ~2h+h
as h=log(n), 2h=n
Therefore S=n+log(n) T(C)=O(n)
While building a heap, lets say you're taking a bottom up approach.
You take each element and compare it with its children to check if the pair conforms to the heap rules. So, therefore, the leaves get included in the heap for free. That is because they have no children.
Moving upwards, the worst case scenario for the node right above the leaves would be 1 comparison (At max they would be compared with just one generation of children)
Moving further up, their immediate parents can at max be compared with two generations of children.
Continuing in the same direction, you'll have log(n) comparisons for the root in the worst case scenario. and log(n)-1 for its immediate children, log(n)-2 for their immediate children and so on.
So summing it all up, you arrive on something like log(n) + {log(n)-1}*2 + {log(n)-2}*4 + ..... + 1*2^{(logn)-1} which is nothing but O(n).
We get the runtime for the heap build by figuring out the maximum move each node can take.
So we need to know how many nodes are in each row and how far from their can each node go.
Starting from the root node each next row has double the nodes than the previous row has, so by answering how often can we double the number of nodes until we don't have any nodes left we get the height of the tree.
Or in mathematical terms the height of the tree is log2(n), n being the length of the array.
To calculate the nodes in one row we start from the back, we know n/2 nodes are at the bottom, so by dividing by 2 we get the previous row and so on.
Based on this we get this formula for the Siftdown approach:
(0 * n/2) + (1 * n/4) + (2 * n/8) + ... + (log2(n) * 1)
The term in the last paranthesis is the height of the tree multiplied by the one node that is at the root, the term in the first paranthesis are all the nodes in the bottom row multiplied by the length they can travel,0.
Same formula in smart:
Bringing the n back in we have 2 * n, 2 can be discarded because its a constant and tada we have the worst case runtime of the Siftdown approach: n.
Short Answer
Building a binary heap will take O(n) time with Heapify().
When we add the elements in a heap one by one and keep satisfying the heap property (max heap or min heap) at every step, then the total time complexity will be O(nlogn).
Because the general structure of a binary heap is of a complete binary tree. Hence the height of heap is h = O(logn). So the insertion time of an element in the heap is equivalent to the height of the tree ie. O(h) = O(logn). For n elements this will take O(nlogn) time.
Consider another approach now.
I assume that we have a min heap for simplicity.
So every node should be smaller than its children.
Add all the elements in the skeleton of a complete binary tree. This will take O(n) time.
Now we just have to somehow satisfy the min-heap property.
Since all the leaf elements have no children, they already satisfy the heap property. The total no of leaf elements is ceil(n/2) where n is the total number of elements present in the tree.
Now for every internal node, if it is greater than its children, swap it with the minimum child in a bottom to top way. It will take O(1) time for every internal node. Note: We will not swap the values up to the root like we do in insertion. We just swap it once so that subtree rooted at that node is a proper min heap.
In the array-based implementation of binary heap, we have parent(i) = ceil((i-1)/2) and children of i are given by 2*i + 1 and 2*i + 2. So by observation we can say that the last ceil(n/2) elements in the array would be leaf nodes. The more the depth, the more the index of a node. We will repeat Step 4 for array[n/2], array[n/2 - 1].....array[0]. In this way we ensure that we do this in the bottom to top approach. Overall we will eventually maintain the min heap property.
Step 4 for all the n/2 elements will take O(n) time.
So our total time complexity of heapify using this approach will be O(n) + O(n) ~ O(n).
In case of building the heap, we start from height,
logn -1 (where logn is the height of tree of n elements).
For each element present at height 'h', we go at max upto (logn -h) height down.
So total number of traversal would be:-
T(n) = sigma((2^(logn-h))*h) where h varies from 1 to logn
T(n) = n((1/2)+(2/4)+(3/8)+.....+(logn/(2^logn)))
T(n) = n*(sigma(x/(2^x))) where x varies from 1 to logn
and according to the [sources][1]
function in the bracket approaches to 2 at infinity.
Hence T(n) ~ O(n)
Successive insertions can be described by:
T = O(log(1) + log(2) + .. + log(n)) = O(log(n!))
By starling approximation, n! =~ O(n^(n + O(1))), therefore T =~ O(nlog(n))
Hope this helps, the optimal way O(n) is using the build heap algorithm for a given set (ordering doesn't matter).
Basically, work is done only on non-leaf nodes while building a heap...and the work done is the amount of swapping down to satisfy heap condition...in other words (in worst case) the amount is proportional to the height of the node...all in all the complexity of the problem is proportional to the sum of heights of all the non-leaf nodes..which is (2^h+1 - 1)-h-1=n-h-1= O(n)
#bcorso has already demonstrated the proof of the complexity analysis. But for the sake of those still learning complexity analysis, I have this to add:
The basis of your original mistake is due to a misinterpretation of the meaning of the statement, "insertion into a heap takes O(log n) time". Insertion into a heap is indeed O(log n), but you have to recognise that n is the size of the heap during the insertion.
In the context of inserting n objects into a heap, the complexity of the ith insertion is O(log n_i) where n_i is the size of the heap as at insertion i. Only the last insertion has a complexity of O (log n).
Lets suppose you have N elements in a heap.
Then its height would be Log(N)
Now you want to insert another element, then the complexity would be : Log(N), we have to compare all the way UP to the root.
Now you are having N+1 elements & height = Log(N+1)
Using induction technique it can be proved that the complexity of insertion would be ∑logi.
Now using
log a + log b = log ab
This simplifies to : ∑logi=log(n!)
which is actually O(NlogN)
But
we are doing something wrong here, as in all the case we do not reach at the top.
Hence while executing most of the times we may find that, we are not going even half way up the tree. Whence, this bound can be optimized to have another tighter bound by using mathematics given in answers above.
This realization came to me after a detail though & experimentation on Heaps.
I really like explanation by Jeremy west.... another approach which is really easy for understanding is given here http://courses.washington.edu/css343/zander/NotesProbs/heapcomplexity
since, buildheap depends using depends on heapify and shiftdown approach is used which depends upon sum of the heights of all nodes. So, to find the sum of height of nodes which is given by
S = summation from i = 0 to i = h of (2^i*(h-i)), where h = logn is height of the tree
solving s, we get s = 2^(h+1) - 1 - (h+1)
since, n = 2^(h+1) - 1
s = n - h - 1 = n- logn - 1
s = O(n), and so complexity of buildheap is O(n).
"The linear time bound of build Heap, can be shown by computing the sum of the heights of all the nodes in the heap, which is the maximum number of dashed lines.
For the perfect binary tree of height h containing N = 2^(h+1) – 1 nodes, the sum of the heights of the nodes is N – H – 1.
Thus it is O(N)."
Proof of O(n)
The proof isn't fancy, and quite straightforward, I only proved the case for a full binary tree, the result can be generalized for a complete binary tree.
We can use another optimal solution to build a heap instead of inserting each element repeatedly. It goes as follows:
Arbitrarily putting the n elements into the array to respect the
shape property of heap.
Starting from the lowest level and moving upwards, sift the root of
each subtree downward as in the heapify-down process until the heap
property is restored.
This process can be illustrated with the following image:
Next, let’s analyze the time complexity of this above process. Suppose there are n elements in the heap, and the height of the heap is h (for the heap in the above image, the height is 3). Then we should have the following relationship:
When there is only one node in the last level then n = 2^h
. And when the last level of the tree is fully filled then n = 2^(h+1).
And start from the bottom as level 0 (the root node is level h), in level j, there are at most 2^(h-j) nodes. And each node at most takes j times swap operation. So in level j, the total number of operation is j*2^(h-j).
So the total running time for building the heap is proportional to:
If we factor out the 2^h term, then we get:
As we know, ∑j/2ʲ is a series converges to 2 (in detail, you can refer to this wiki).
Using this we have:
Based on the condition 2^h <= n, so we have:
Now we prove that building a heap is a linear operation.
The question is pretty much in the title, but say I have a list L
L = [1,2,3,4,5]
min(L) = 1 here. Now I remove 4. The min is still 1. Then I remove 2. The min is still 1. Then I remove 1. The min is now 3. Then I remove 3. The min is now 5, and so on.
I am wondering if there is a good way to keep track of the min of the list at all times without needing to do min(L) or scanning through the entire list, etc.
There is an efficiency cost to actually removing the items from the list because it has to move everything else over. Re-sorting the list each time is expensive, too. Is there a way around this?
To remove a random element you need to know what elements have not been removed yet.
To know the minimum element, you need to sort or scan the items.
A min heap implemented as an array neatly solves both problems. The cost to remove an item is O(log N) and the cost to find the min is O(1). The items are stored contiguously in an array, so choosing one at random is very easy, O(1).
The min heap is described on this Wikipedia page
BTW, if the data are large, you can leave them in place and store pointers or indexes in the min heap and adjust the comparison operator accordingly.
Google for self-balancing binary search trees. Building one from the initial list takes O(n lg n) time, and finding and removing an arbitrary item will take O(lg n) (instead of O(n) for finding/removing from a simple list). A smallest item will always appear in the root of the tree.
This question may be useful. It provides links to several implementation of various balanced binary search trees. The advice to use a hash table does not apply well to your case, since it does not address maintaining a minimum item.
Here's a solution that need O(N lg N) preprocessing time + O(lg N) update time and O(lg(n)*lg(n)) delete time.
Preprocessing:
step 1: sort the L
step 2: for each item L[i], map L[i]->i
step 3: Build a Binary Indexed Tree or segment tree where for every 1<=i<=length of L, BIT[i]=1 and keep the sum of the ranges.
Query type delete:
Step 1: if an item x is said to be removed, with a binary search on array L (where L is sorted) or from the mapping find its index. set BIT[index[x]] = 0 and update all the ranges. Runtime: O(lg N)
Query type findMin:
Step 1: do a binary search over array L. for every mid, find the sum on BIT from 1-mid. if BIT[mid]>0 then we know some value<=mid is still alive. So we set hi=mid-1. otherwise we set low=mid+1. Runtime: O(lg**2N)
Same can be done with Segment tree.
Edit: If I'm not wrong per query can be processed in O(1) with Linked List
If sorting isn't in your best interest, I would suggest only do comparisons where you need to do them. If you remove elements that are not the old minimum, and you aren't inserting any new elements, there isn't a re-scan necessary for a minimum value.
Can you give us some more information about the processing going on that you are trying to do?
Comment answer: You don't have to compute min(L). Just keep track of its index and then only re-run the scan for min(L) when you remove at(or below) the old index (and make sure you track it accordingly).
Your current approach of rescanning when the minimum is removed is O(1)-time in expectation for each removal (assuming every item is equally likely to be removed).
Given a list of n items, a rescan is necessary with probability 1/n, so the expected work at each step is n * 1/n = O(1).
Hey. I have a very large array and I want to find the Nth largest value. Trivially I can sort the array and then take the Nth element but I'm only interested in one element so there's probably a better way than sorting the entire array...
A heap is the best data structure for this operation and Python has an excellent built-in library to do just this, called heapq.
import heapq
def nth_largest(n, iter):
return heapq.nlargest(n, iter)[-1]
Example Usage:
>>> import random
>>> iter = [random.randint(0,1000) for i in range(100)]
>>> n = 10
>>> nth_largest(n, iter)
920
Confirm result by sorting:
>>> list(sorted(iter))[-10]
920
Sorting would require O(nlogn) runtime at minimum - There are very efficient selection algorithms which can solve your problem in linear time.
Partition-based selection (sometimes Quick select), which is based on the idea of quicksort (recursive partitioning), is a good solution (see link for pseudocode + Another example).
A simple modified quicksort works very well in practice. It has average running time proportional to N (though worst case bad luck running time is O(N^2)).
Proceed like a quicksort. Pick a pivot value randomly, then stream through your values and see if they are above or below that pivot value and put them into two bins based on that comparison.
In quicksort you'd then recursively sort each of those two bins. But for the N-th highest value computation, you only need to sort ONE of the bins.. the population of each bin tells you which bin holds your n-th highest value. So for example if you want the 125th highest value, and you sort into two bins which have 75 in the "high" bin and 150 in the "low" bin, you can ignore the high bin and just proceed to finding the 125-75=50th highest value in the low bin alone.
You can iterate the entire sequence maintaining a list of the 5 largest values you find (this will be O(n)). That being said I think it would just be simpler to sort the list.
You could try the Median of Medians method - it's speed is O(N).
Use heapsort. It only partially orders the list until you draw the elements out.
You essentially want to produce a "top-N" list and select the one at the end of that list.
So you can scan the array once and insert into an empty list when the largeArray item is greater than the last item of your top-N list, then drop the last item.
After you finish scanning, pick the last item in your top-N list.
An example for ints and N = 5:
int[] top5 = new int[5]();
top5[0] = top5[1] = top5[2] = top5[3] = top5[4] = 0x80000000; // or your min value
for(int i = 0; i < largeArray.length; i++) {
if(largeArray[i] > top5[4]) {
// insert into top5:
top5[4] = largeArray[i];
// resort:
quickSort(top5);
}
}
As people have said, you can walk the list once keeping track of K largest values. If K is large this algorithm will be close to O(n2).
However, you can store your Kth largest values as a binary tree and the operation becomes O(n log k).
According to Wikipedia, this is the best selection algorithm:
function findFirstK(list, left, right, k)
if right > left
select pivotIndex between left and right
pivotNewIndex := partition(list, left, right, pivotIndex)
if pivotNewIndex > k // new condition
findFirstK(list, left, pivotNewIndex-1, k)
if pivotNewIndex < k
findFirstK(list, pivotNewIndex+1, right, k)
Its complexity is O(n)
One thing you should do if this is in production code is test with samples of your data.
For example, you might consider 1000 or 10000 elements 'large' arrays, and code up a quickselect method from a recipe.
The compiled nature of sorted, and its somewhat hidden and constantly evolving optimizations, make it faster than a python written quickselect method on small to medium sized datasets (< 1,000,000 elements). Also, you might find as you increase the size of the array beyond that amount, memory is more efficiently handled in native code, and the benefit continues.
So, even if quickselect is O(n) vs sorted's O(nlogn), that doesn't take into account how many actual machine code instructions processing each n elements will take, any impacts on pipelining, uses of processor caches and other things the creators and maintainers of sorted will bake into the python code.
You can keep two different counts for each element -- the number of elements bigger than the element, and the number of elements lesser than the element.
Then do a if check N == number of elements bigger than each element
-- the element satisfies this above condition is your output
check below solution
def NthHighest(l,n):
if len(l) <n:
return 0
for i in range(len(l)):
low_count = 0
up_count = 0
for j in range(len(l)):
if l[j] > l[i]:
up_count = up_count + 1
else:
low_count = low_count + 1
# print(l[i],low_count, up_count)
if up_count == n-1:
#print(l[i])
return l[i]
# # find the 4th largest number
l = [1,3,4,9,5,15,5,13,19,27,22]
print(NthHighest(l,4))
-- using the above solution you can find both - Nth highest as well as Nth Lowest
If you do not mind using pandas then:
import pandas as pd
N = 10
column_name = 0
pd.DataFrame(your_array).nlargest(N, column_name)
The above code will show you the N largest values along with the index position of each value.
Hope it helps. :-)
Pandas Nlargest Documentation