O(N) Time complexity for simple Python function - python

I just took a Codility demo test. The question and my answer can be seen here, but I'll paste my answer here as well. My response:
def solution(A):
# write your code in Python 2.7
retresult = 1; # the smallest integer we can return, if it is not in the array
A.sort()
for i in A:
if i > 0:
if i==retresult: retresult += 1 # increment the result since the current result exists in the array
elif i>retresult: break # we can go out of the loop since we found a bigger number than our current positive integer result
return retresult
My question is around time complexity, which I hope to better understand by your response. The question asks for expected worst-case time complexity is O(N).
Does my function have O(N) time complexity? Does the fact that I sort the array increase the complexity, and if so how?
Codility reports (for my answer)
Detected time complexity:
O(N) or O(N * log(N))
So, what is the complexity for my function? And if it is O(N*log(N)), what can I do to decrease the complexity to O(N) as the problem states?
Thanks very much!
p.s. my background reading on time complexity comes from this great post.
EDIT
Following the reply below, and the answers described here for this problem, I would like to expand on this with my take on the solutions:
basicSolution has an expensive time complexity and so is not the right answer for this Codility test:
def basicSolution(A):
# 0(N*log(N) time complexity
retresult = 1; # the smallest integer we can return, if it is not in the array
A.sort()
for i in A:
if i > 0:
if i==retresult: retresult += 1 #increment the result since the current result exists in the array
elif i>retresult: break # we can go out of the loop since we found a bigger number than our current positive integer result
else:
continue; # negative numbers and 0 don't need any work
return retresult
hashSolution is my take on what is described in the above article, in the "use hashing" paragraph. As I am new to Python, please let me know if you have any improvements to this code (it does work though against my test cases), and what time complexity this has?
def hashSolution(A):
# 0(N) time complexity, I think? but requires 0(N) extra space (requirement states to use 0(N) space
table = {}
for i in A:
if i > 0:
table[i] = True # collision/duplicate will just overwrite
for i in range(1,100000+1): # the problem says that the array has a maximum of 100,000 integers
if not(table.get(i)): return i
return 1 # default
Finally, the actual 0(N) solution (O(n) time and O(1) extra space solution) I am having trouble understanding. I understand that negative/0 values are pushed at the back of the array, and then we have an array of just positive values. But I do not understand the findMissingPositive function - could anyone please describe this with Python code/comments? With an example perhaps? I've been trying to work through it in Python and just cannot figure it out :(

It does not, because you sort A.
The Python list.sort() function uses Timsort (named after Tim Peters), and has a worst-case time complexity of O(NlogN).
Rather than sort your input, you'll have to iterate over it and determine if any integers are missing by some other means. I'd use a set of a range() object:
def solution(A):
expected = set(range(1, len(A) + 1))
for i in A:
expected.discard(i)
if not expected:
# all consecutive digits for len(A) were present, so next is missing
return len(A) + 1
return min(expected)
This is O(N); we create a set of len(A) (O(N) time), then we loop over A, removing elements from expected (again O(N) time, removing elements from a set is O(1)), then test for expected being empty (O(1) time), and finally get the smallest element in expected (at most O(N) time).
So we make at most 3 O(N) time steps in the above function, making it a O(N) solution.
This also fits the storage requirement; all use is a set of size N. Sets have a small overhead, but always smaller than N.
The hash solution you found is based on the same principle, except that it uses a dictionary instead of a set. Note that the dictionary values are never actually used, they are either set to True or absent. I'd rewrite that as:
def hashSolution(A):
seen = {i for i in A if i > 0}
if not seen:
# there were no positive values, so 1 is the first missing.
return 1
for i in range(1, 10**5 + 1):
if i not in seen:
return i
# we can never get here because the inputs are limited to integers up to
# 10k. So either `seen` has a limited number of positive values below
# 10.000 or none at all.
The above avoids looping all the way to 10.000 if there were no positive integers in A.
The difference between mine and theirs is that mine starts with the set of expected numbers, while they start with the set of positive values from A, inverting the storage and test.

Related

Trying to calculate algorithm time complexity

So last night I solved this LeetCode question. My solution is not great, quite slow. So I'm trying to calculate the complexity of my algorithm to compare with standard algorithms that LeetCode lists in the Solution section. Here's my solution:
class Solution:
def longestCommonPrefix(self, strs: List[str]) -> str:
# Get lengths of all strings in the list and get the minimum
# since common prefix can't be longer than the shortest string.
# Catch ValueError if list is empty
try:
min_len = min(len(i) for i in strs)
except ValueError:
return ''
# split strings into sets character-wise
foo = [set(list(zip(*strs))[k]) for k in range(min_len)]
# Now go through the resulting list and check whether resulting sets have length of 1
# If true then add those characters to the prefix list. Break as soon as encounter
# a set of length > 1.
prefix = []
for i in foo:
if len(i) == 1:
x, = i
prefix.append(x)
else:
break
common_prefix = ''.join(prefix)
return common_prefix
I'm struggling a bit with calculating complexity. First step - getting minimum length of strings - takes O(n) where n is number of strings in the list. Then the last step is also easy - it should take O(m) where m is the length of the shortest string.
But the middle bit is confusing. set(list(zip(*strs))) should hopefully take O(m) again and then we do it n times so O(mn). But then overall complexity is O(mn + m + n) which seems way too low for how slow the solution is.
The other option is that the middle step is O(m^2*n), which makes a bit more sense. What is the proper way to calculate complexity here?
Yes, the middle portion is O{mn}, as well the overall is O{mn} because that dwarfs the O{m} and O{n} terms for large values of m and n.
Your solution has an ideal order of runtime complexity.
Optimize: Short-Circuit
However, you are probably dismayed that others have faster solutions. I suspect that others likely short-circuit on the first non-matching index.
Let's consider a test case of 26 strings (['a'*500, 'b'*500, 'c'*500, ...]). Your solution would proceed to create a list that is 500 long, with each entry containing a set of 26 elements. Meanwhile, if you short-circuited, you would only process the first index, ie one set of 26 characters.
Try changing your list into a generator. This might be all you need to short-circuit.
foo = (set(x) for x in zip(*strs)))
You can skip min_len check because default behaviour of zip is to iterate only as long as the shortest input.
Optimize: Generating Intermediate Results
I see that you append each letter to a list, then ''.join(lst). This is efficient, especially compared to the alternative of iteratively appending to a string.
However, we could just as easily save a counter match_len. Then when we detect the first mis-match, just:
return strs[0][:match_len]

how to calculate the minimum unfairness sum of a list

I have tried to summarize the problem statement something like this::
Given n, k and an array(a list) arr where n = len(arr) and k is an integer in set (1, n) inclusive.
For an array (or list) myList, The Unfairness Sum is defined as the sum of the absolute differences between all possible pairs (combinations with 2 elements each) in myList.
To explain: if mylist = [1, 2, 5, 5, 6] then Minimum unfairness sum or MUS. Please note that elements are considered unique by their index in list not their values
MUS = |1-2| + |1-5| + |1-5| + |1-6| + |2-5| + |2-5| + |2-6| + |5-5| + |5-6| + |5-6|
If you actually need to look at the problem statement, It's HERE
My Objective
given n, k, arr(as described above), find the Minimum Unfairness Sum out of all of the unfairness sums of sub arrays possible with a constraint that each len(sub array) = k [which is a good thing to make our lives easy, I believe :) ]
what I have tried
well, there is a lot to be added in here, so I'll try to be as short as I can.
My First approach was this where i used itertools.combinations to get all the possible combinations and statistics.variance to check its spread of data (yeah, I know I'm a mess).
Before you see the code below, Do you think these variance and unfairness sum are perfectly related (i know they are strongly related) i.e. the sub array with minimum variance has to be the sub array with MUS??
You only have to check the LetMeDoIt(n, k, arr) function. If you need MCVE, check the second code snippet below.
from itertools import combinations as cmb
from statistics import variance as varn
def LetMeDoIt(n, k, arr):
v = []
s = []
subs = [list(x) for x in list(cmb(arr, k))] # getting all sub arrays from arr in a list
i = 0
for sub in subs:
if i != 0:
var = varn(sub) # the variance thingy
if float(var) < float(min(v)):
v.remove(v[0])
v.append(var)
s.remove(s[0])
s.append(sub)
else:
pass
elif i == 0:
var = varn(sub)
v.append(var)
s.append(sub)
i = 1
final = []
f = list(cmb(s[0], 2)) # getting list of all pairs (after determining sub array with least MUS)
for r in f:
final.append(abs(r[0]-r[1])) # calculating the MUS in my messy way
return sum(final)
The above code works fine for n<30 but raised a MemoryError beyond that.
In Python chat, Kevin suggested me to try generator which is memory efficient (it really is), but as generator also generates those combination on the fly as we iterate over them, it was supposed to take over 140 hours (:/) for n=50, k=8 as estimated.
I posted the same as a question on SO HERE (you might wanna have a look to understand me properly - it has discussions and an answer by fusion which takes me to my second approach - a better one(i should say fusion's approach xD)).
Second Approach
from itertools import combinations as cmb
def myvar(arr): # a function to calculate variance
l = len(arr)
m = sum(arr)/l
return sum((i-m)**2 for i in arr)/l
def LetMeDoIt(n, k, arr):
sorted_list = sorted(arr) # i think sorting the array makes it easy to get the sub array with MUS quickly
variance = None
min_variance_sub = None
for i in range(n - k + 1):
sub = sorted_list[i:i+k]
var = myvar(sub)
if variance is None or var<variance:
variance = var
min_variance_sub=sub
final = []
f = list(cmb(min_variance_sub, 2)) # again getting all possible pairs in my messy way
for r in f:
final.append(abs(r[0] - r[1]))
return sum(final)
def MainApp():
n = int(input())
k = int(input())
arr = list(int(input()) for _ in range(n))
result = LetMeDoIt(n, k, arr)
print(result)
if __name__ == '__main__':
MainApp()
This code works perfect for n up to 1000 (maybe more), but terminates due to time out (5 seconds is the limit on online judge :/ ) for n beyond 10000 (the biggest test case has n=100000).
=====
How would you approach this problem to take care of all the test cases in given time limits (5 sec) ? (problem was listed under algorithm & dynamic programming)
(for your references you can have a look on
successful submissions(py3, py2, C++, java) on this problem by other candidates - so that you can
explain that approach for me and future visitors)
an editorial by the problem setter explaining how to approach the question
a solution code by problem setter himself (py2, C++).
Input data (test cases) and expected output
Edit1 ::
For future visitors of this question, the conclusions I have till now are,
that variance and unfairness sum are not perfectly related (they are strongly related) which implies that among a lots of lists of integers, a list with minimum variance doesn't always have to be the list with minimum unfairness sum. If you want to know why, I actually asked that as a separate question on math stack exchange HERE where one of the mathematicians proved it for me xD (and it's worth taking a look, 'cause it was unexpected)
As far as the question is concerned overall, you can read answers by archer & Attersson below (still trying to figure out a naive approach to carry this out - it shouldn't be far by now though)
Thank you for any help or suggestions :)
You must work on your list SORTED and check only sublists with consecutive elements. This is because BY DEFAULT, any sublist that includes at least one element that is not consecutive, will have higher unfairness sum.
For example if the list is
[1,3,7,10,20,35,100,250,2000,5000] and you want to check for sublists with length 3, then solution must be one of [1,3,7] [3,7,10] [7,10,20] etc
Any other sublist eg [1,3,10] will have higher unfairness sum because 10>7 therefore all its differences with rest of elements will be larger than 7
The same for [1,7,10] (non consecutive on the left side) as 1<3
Given that, you only have to check for consecutive sublists of length k which reduces the execution time significantly
Regarding coding, something like this should work:
def myvar(array):
return sum([abs(i[0]-i[1]) for i in itertools.combinations(array,2)])
def minsum(n, k, arr):
res=1000000000000000000000 #alternatively make it equal with first subarray
for i in range(n-k):
res=min(res, myvar(l[i:i+k]))
return res
I see this question still has no complete answer. I will write a track of a correct algorithm which will pass the judge. I will not write the code in order to respect the purpose of the Hackerrank challenge. Since we have working solutions.
The original array must be sorted. This has a complexity of O(NlogN)
At this point you can check consecutive sub arrays as non-consecutive ones will result in a worse (or equal, but not better) "unfairness sum". This is also explained in archer's answer
The last check passage, to find the minimum "unfairness sum" can be done in O(N). You need to calculate the US for every consecutive k-long subarray. The mistake is recalculating this for every step, done in O(k), which brings the complexity of this passage to O(k*N). It can be done in O(1) as the editorial you posted shows, including mathematic formulae. It requires a previous initialization of a cumulative array after step 1 (done in O(N) with space complexity O(N) too).
It works but terminates due to time out for n<=10000.
(from comments on archer's question)
To explain step 3, think about k = 100. You are scrolling the N-long array and the first iteration, you must calculate the US for the sub array from element 0 to 99 as usual, requiring 100 passages. The next step needs you to calculate the same for a sub array that only differs from the previous by 1 element 1 to 100. Then 2 to 101, etc.
If it helps, think of it like a snake. One block is removed and one is added.
There is no need to perform the whole O(k) scrolling. Just figure the maths as explained in the editorial and you will do it in O(1).
So the final complexity will asymptotically be O(NlogN) due to the first sort.

What is the difference in time complexity between these two blocks of code (if any) and why?

Trying to solidify my knowledge about Time Complexity. I think I know the answer to this, but would like to hear some good explanations.
main = []
while len(main) < 5:
sub = []
while len(sub) < 5:
sub.append(random.randint(1,10))
main.append(sub)
VS
main = []
sub = []
while len(main) < 5:
sub.append(random.randint(1,10))
if len(sub) == 5:
main.append(list(sub))
sub = []
There's no difference, since the time complexity is constant in both cases - you perform a constant amount of operations both times.
The time complexity in both are O(1) - constant time because they both perform a constant number of operations as #Yakov Dan already stated.
This is because time complexity is usually expressed as a function of a variable number(say n) and tends to show how changing the value of n will change the time the algorithm will take.
Now, assuming you had n instead of 5, then you would have O(n^2) for both cases. It may be tricky for the second case since a basic way of checking the polynomial complexity is to count the number of nested loops and you can be lead to conclude that the second version is O(n) since it has a single loop.
However, carefully looking at it will show you that the loop runs n(5 in this case) times for sub for each value appended to main, so it is essentially the same.
This of course assumes that the in-built list.append is atomic or runs in a constant time.

What is optimal algorithm to check if a given integer is equal to sum of two elements of an int array?

def check_set(S, k):
S2 = k - S
set_from_S2=set(S2.flatten())
for x in S:
if(x in set_from_S2):
return True
return False
I have a given integer k. I want to check if k is equal to sum of two element of array S.
S = np.array([1,2,3,4])
k = 8
It should return False in this case because there are no two elements of S having sum of 8. The above code work like 8 = 4 + 4 so it returned True
I can't find an algorithm to solve this problem with complexity of O(n).
Can someone help me?
You have to account for multiple instances of the same item, so set is not good choice here.
Instead you can exploit dictionary with value_field = number_of_keys (as variant - from collections import Counter)
A = [3,1,2,3,4]
Cntr = {}
for x in A:
if x in Cntr:
Cntr[x] += 1
else:
Cntr[x] = 1
#k = 11
k = 8
ans = False
for x in A:
if (k-x) in Cntr:
if k == 2 * x:
if Cntr[k-x] > 1:
ans = True
break
else:
ans = True
break
print(ans)
Returns True for k=5,6 (I added one more 3) and False for k=8,11
Adding onto MBo's answer.
"Optimal" can be an ambiguous term in terms of algorithmics, as there is often a compromise between how fast the algorithm runs and how memory-efficient it is. Sometimes we may also be interested in either worst-case resource consumption or in average resource consumption. We'll loop at worst-case here because it's simpler and roughly equivalent to average in our scenario.
Let's call n the length of our array, and let's consider 3 examples.
Example 1
We start with a very naive algorithm for our problem, with two nested loops that iterate over the array, and check for every two items of different indices if they sum to the target number.
Time complexity: worst-case scenario (where the answer is False or where it's True but that we find it on the last pair of items we check) has n^2 loop iterations. If you're familiar with the big-O notation, we'll say the algorithm's time complexity is O(n^2), which basically means that in terms of our input size n, the time it takes to solve the algorithm grows more or less like n^2 with multiplicative factor (well, technically the notation means "at most like n^2 with a multiplicative factor, but it's a generalized abuse of language to use it as "more or less like" instead).
Space complexity (memory consumption): we only store an array, plus a fixed set of objects whose sizes do not depend on n (everything Python needs to run, the call stack, maybe two iterators and/or some temporary variables). The part of the memory consumption that grows with n is therefore just the size of the array, which is n times the amount of memory required to store an integer in an array (let's call that sizeof(int)).
Conclusion: Time is O(n^2), Memory is n*sizeof(int) (+O(1), that is, up to an additional constant factor, which doesn't matter to us, and which we'll ignore from now on).
Example 2
Let's consider the algorithm in MBo's answer.
Time complexity: much, much better than in Example 1. We start by creating a dictionary. This is done in a loop over n. Setting keys in a dictionary is a constant-time operation in proper conditions, so that the time taken by each step of that first loop does not depend on n. Therefore, for now we've used O(n) in terms of time complexity. Now we only have one remaining loop over n. The time spent accessing elements our dictionary is independent of n, so once again, the total complexity is O(n). Combining our two loops together, since they both grow like n up to a multiplicative factor, so does their sum (up to a different multiplicative factor). Total: O(n).
Memory: Basically the same as before, plus a dictionary of n elements. For the sake of simplicity, let's consider that these elements are integers (we could have used booleans), and forget about some of the aspects of dictionaries to only count the size used to store the keys and the values. There are n integer keys and n integer values to store, which uses 2*n*sizeof(int) in terms of memory. Add to that what we had before and we have a total of 3*n*sizeof(int).
Conclusion: Time is O(n), Memory is 3*n*sizeof(int). The algorithm is considerably faster when n grows, but uses three times more memory than example 1. In some weird scenarios where almost no memory is available (embedded systems maybe), this 3*n*sizeof(int) might simply be too much, and you might not be able to use this algorithm (admittedly, it's probably never going to be a real issue).
Example 3
Can we find a trade-off between Example 1 and Example 2?
One way to do that is to replicate the same kind of nested loop structure as in Example 1, but with some pre-processing to replace the inner loop with something faster. To do that, we sort the initial array, in place. Done with well-chosen algorithms, this has a time-complexity of O(n*log(n)) and negligible memory usage.
Once we have sorted our array, we write our outer loop (which is a regular loop over the whole array), and then inside that outer loop, use dichotomy to search for the number we're missing to reach our target k. This dichotomy approach would have a memory consumption of O(log(n)), and its time complexity would be O(log(n)) as well.
Time complexity: The pre-processing sort is O(n*log(n)). Then in the main part of the algorithm, we have n calls to our O(log(n)) dichotomy search, which totals to O(n*log(n)). So, overall, O(n*log(n)).
Memory: Ignoring the constant parts, we have the memory for our array (n*sizeof(int)) plus the memory for our call stack in the dichotomy search (O(log(n))). Total: n*sizeof(int) + O(log(n)).
Conclusion: Time is O(n*log(n)), Memory is n*sizeof(int) + O(log(n)). Memory is almost as small as in Example 1. Time complexity is slightly more than in Example 2. In scenarios where the Example 2 cannot be used because we lack memory, the next best thing in terms of speed would realistically be Example 3, which is almost as fast as Example 2 and probably has enough room to run if the very slow Example 1 does.
Overall conclusion
This answer was just to show that "optimal" is context-dependent in algorithmics. It's very unlikely that in this particular example, one would choose to implement Example 3. In general, you'd see either Example 1 if n is so small that one would choose whatever is simplest to design and fastest to code, or Example 2 if n is a bit larger and we want speed. But if you look at the wikipedia page I linked for sorting algorithms, you'll see that none of them is best at everything. They all have scenarios where they could be replaced with something better.

Big O notation of simple anagram function

I've worked up the following code which finds anagrams. I had thought the big O notation for this was O(n) But was informed by my instructor that I am incorrect. I am confused on why this is not correct however, would anyone be able to offer any advice?
# Define an anagram.
def anagram(s1, s2):
return sorted(s1) == sorted(s2)
# Main function.
def Question1(t, s):
# use built in any function to check any anagram of t is substring of s
return any(anagram(s[i: i+len(t)], t)
for i in range(len(s)-len(t)+ 1))
Function Call:
# Simple test case.
print Question1("app", "paple")
# True
any anagram of t is substring of s
That's not what your code says.
You have "any substring of s is an anagram of t", which might be equivalent, but it's easier to understand that way.
As for complexity, you need to define what you're calling N... Is it len(s)-len(t)+ 1?
The function any() has complexity N, in that case, yes.
However, you've additionally called anagram over an input of T length, and you seem to have ignored that.
anagram calls sorted twice. Each call to sorted is closer to O(T * log(T)) itself assuming merge sort. You're also performing a list slice, so it could be slightly higher.
Let's say your complexity is somewhere on the order of (S-T) * 2 * (T * log(T)) where T and S are lengths of strings.
The answer depends on which string of your input is larger.
Best case is that they are the same length because then your range only has one element.
Big O notation is worst case, though, so you need to figure out which conditions generate the most complexity in terms of total operations. For example, what if T > S? Then len(s)-len(t)+ 1 will be non positive, so does the code run more or less than equal length strings? And what about S < T or S = 0?
This is not N complexity due a few factors. First one sorted has O(n log n) complexity. And Potentially you can call it few times (and sort T and S), if T long enough.

Categories

Resources