Find longest quasi-constant sub-sequence of a sequence - python

I had this test earlier today, and I tried to be too clever and hit a road block. Unfortunately I got stuck in this mental rut and wasted too much time, failing this portion of the test. I solved it afterward, but maybe y'all can help me get out of the initial rut I was in.
Problem definition:
An unordered and non-unique sequence A consisting of N integers (all positive) is given. A subsequence of A is any sequence obtained by removing none, some or all elements from A. The amplitude of a sequence is the difference between the largest and the smallest element in this sequence. The amplitude of the empty subsequence is assumed to be 0.
For example, consider the sequence A consisting of six elements such that:
A[0] = 1
A[1] = 7
A[2] = 6
A[3] = 2
A[4] = 6
A[5] = 4
A subsequence of array A is called quasi-constant if its amplitude does not exceed 1. In the example above, the subsequences [1,2], [6,6], and [6,6,7] are quasi-constant. Subsequence [6, 6, 7] is the longest possible quasi-constant subsequence of A.
Now, find a solution that, given a non-empty zero-indexed array A consisting of N integers, returns the length of the longest quasi-constant subsequence of array A. For example, given sequence A outlined above, the function should return 3, as explained.
Now, I solved this in python 3.6 after the fact using a sort-based method with no classes (my code is below), but I didn't initially want to do that as sorting on large lists can be very slow. It seemed this should have a relatively simple formulation as a breadth-first tree-based class, but I couldn't get it right. Any thoughts on this?
My class-less sort-based solution:
def amp(sub_list):
if len(sub_list) <2:
return 0
else:
return max(sub_list) - min(sub_list)
def solution(A):
A.sort()
longest = 0
idxStart = 0
idxEnd = idxStart + 1
while idxEnd <= len(A):
tmp = A[idxStart:idxEnd]
if amp(tmp) < 2:
idxEnd += 1
if len(tmp) > longest:
longest = len(tmp)
else:
idxStart = idxEnd
idxEnd = idxStart + 1
return longest

As Andrey Tyukin pointed out, you can solve this problem in O(n) time, which is better than the O(n log n) time you'd likely get from either sorting or any kind of tree based solution. The trick is to use dictionaries to count the number of occurrences of each number in the input, and use the count to figure out the longest subsequence.
I had a similar idea to him, but I had though of a slightly different implementation. After a little testing, it looks like my approach is a quite a bit faster, so I'm posting it as my own answer. It's quite short!
from collections import Counter
def solution(seq):
if not seq: # special case for empty input sequence
return 0
counts = Counter(seq)
return max(counts[x] + counts[x+1] for x in counts)
I suspect this is faster than Andrey's solution because the running time for both of our solutions really take O(n) + O(k) time where k is the number of distinct values in the input (and n is the total number of values in the input). My code handles the O(n) part very efficiently by handing off the sequence to the Counter constructor, which is implemented in C. It is likely to be a bit slower (on a per-item basis) to deal with the O(k) part, since it needs a generator expression. Andrey's code does the reverse (it runs slower Python code for the O(n) part, and uses faster builtin C functions for the O(k) part). Since k is always less than or equal to n (perhaps a lot less if the sequence has a lot of repeated values), my code is faster overall. Both solutions are still O(n) though, and both should be much better than sorting for large inputs.

I don't know how BFS is supposed to help here.
Why not simply run once through the sequence and count how many elements every possible quasi-constant subsequence would have?
from collections import defaultdict
def longestQuasiConstantSubseqLength(seq):
d = defaultdict(int)
for s in seq:
d[s] += 1
d[s+1] += 1
return max(d.values() or [0])
s = [1,7,6,2,6,4]
print(longestQuasiConstantSubseqLength(s))
prints:
3
as expected.
Explanation: Every non-constant quasi-constant subsequence is uniquely identified by the greatest number that it contains (there can be only two, take the greater one). Now, if you have a number s, it can either contribute to the quasi-constant subsequence that has s or s + 1 as the greatest number. So, just add +1 to the subsequences identified by s and s + 1. Then output the maximum of all counts.
You can't get it faster than O(n), because you have to look at every entry of the input sequence at least once.

Related

Find all combinations of positive integers in increasing order that adds up to a given positive number n

How to write a function that takes n (where n > 0) and returns the list of all combinations of positive integers that sum to n?
This is a common question on the web. And there are different answers provided such as 1, 2 and 3. However, in the answers provided, they use two functions to solve the problem. I want to do it with only one single function. Therefore, I coded as follows:
def all_combinations_sum_to_n(n):
from itertools import combinations_with_replacement
combinations_list = []
if n < 1:
return combinations_list
l = [i for i in range(1, n + 1)]
for i in range(1, n + 1):
combinations_list = combinations_list + (list(combinations_with_replacement(l, i)))
result = [list(i) for i in combinations_list if sum(i) == n]
result.sort()
return result
If I pass 20 to my function which is all_combinations_sum_to_n(20), the OS of my machine kills the process as it is very costly. I think the space complexity of my function is O(n*n!). How do I modify my code so that I don't have to create any other function and yet my single function has an improved time or space complexity? I don't think it is possible by using itertools.combinations_with_replacement.
UPDATE
All answers provided by Barmar, ShadowRanger and pts are great. As I was looking for an efficient answer in terms of both memory and runtime, I used https://perfpy.com and selected python 3.8 to compare the answers. I used six different values of n and in all cases, ShadowRanger's solution had the highest score. Therefore, I chose ShadowRanger's answer as the best one. The scores were as follows:
You've got two main problems, one causing your current problem (out of memory) and one that will continue the problem even if you solve that one:
You're accumulating all combinations before filtering, so your memory requirements are immense. You don't even need a single list if your function can be a generator (that is iterated to produce a value at a time) rather than returning a fully realized list, and even if you must return a list, you needn't generate such huge intermediate lists. You might think you need at least one list for sorting purposes, but combinations_with_replacement is already guaranteed to produce a predictable order based on the input ordering, and since range is ordered, the values produced will be ordered as well.
Even if you solve the memory problem, the computational cost of just generating that many combinations is prohibitive, due to poor scaling; for the memory, but not CPU, optimized version of the code below, it handles an input of 11 in 0.2 seconds, 12 in ~2.6 seconds, and 13 in ~11 seconds; at that scaling rate, 20 is going to approach heat death of the universe timeframes.
Barmar's answer is one solution to both problems, but it's still doing work eagerly and storing the complete results when the complete work might not even be needed, and it involves sorting and deduplication, which aren't strictly necessary if you're sufficiently careful about how you generate the results.
This answer will fix both problems, first the memory issue, then the speed issue, without ever needing memory requirements above linear in n.
Solving the memory issue alone actually makes for simpler code, that still uses your same basic strategy, but without consuming all your RAM needlessly. The trick is to write a generator function, that avoids storing more than one results at a time (the caller can listify if they know the output is small enough and they actually need it all at once, but typically, just looping over the generator is better):
from collections import deque # Cheap way to just print the last few elements
from itertools import combinations_with_replacement # Imports should be at top of file,
# not repeated per call
def all_combinations_sum_to_n(n):
for i in range(1, n + 1): # For each possible length of combinations...
# For each combination of that length...
for comb in combinations_with_replacement(range(1, n + 1), i):
if sum(comb) == n: # If the sum matches...
yield list(comb) # yield the combination
# 13 is the largest input that will complete in some vaguely reasonable time, ~10 seconds on TIO
print(*deque(all_combinations_sum_to_n(13), maxlen=10), sep='\n')
Try it online!
Again, to be clear, this will not complete in any reasonable amount of time for an input of 20; there's just too much redundant work being done, and the growth pattern for combinations scales with the factorial of the input; you must be more clever algorithmically. But for less intense problems, this pattern is simpler, faster, and dramatically more memory-efficient than a solution that builds up enormous lists and concatenates them.
To solve in a reasonable period of time, using the same generator-based approach (but without itertools, which isn't practical here because you can't tell it to skip over combinations when you know they're useless), here's an adaptation of Barmar's answer that requires no sorting, produces no duplicates, and as a result, can produce the solution set in less than a 20th of a second, even for n = 20:
def all_combinations_sum_to_n(n, *, max_seen=1):
for i in range(max_seen, n // 2 + 1):
for subtup in all_combinations_sum_to_n(n - i, max_seen=i):
yield (i,) + subtup
yield (n,)
for x in all_combinations_sum_to_n(20):
print(x)
Try it online!
That not only produces the individual tuples with internally sorted order (1 is always before 2), but produces the sequence of tuples in sorted order (so looping over sorted(all_combinations_sum_to_n(20)) is equivalent to looping over all_combinations_sum_to_n(20) directly, the latter just avoids the temporary list and a no-op sorting pass).
Use recursion instead of generating all combinations and then filtering them.
def all_combinations_sum_to_n(n):
combinations_set = set()
for i in range(1, n):
for sublist in all_combinations_sum_to_n(n-i):
combinations_set.add(tuple(sorted((i,) + sublist)))
combinations_set.add((n,))
return sorted(combinations_set)
I had a simpler solution that didn't use sorted() and put the results in a list, but it would produce duplicates that just differed in order, e.g. [1, 1, 2] and [1, 2, 1] when n == 4. I added those to get rid of duplicates.
On my MacBook M1 all_combinations_sum_to_n(20) completes in about 0.5 seconds.
Here is a fast iterative solution:
def csum(n):
s = [[None] * (k + 1) for k in range(n + 1)]
for k in range(1, n + 1):
for m in range(k, 0, -1):
s[k][m] = [[f] + terms
for f in range(m, (k >> 1) + 1) for terms in s[k - f][f]]
s[k][m].append([k])
return s[n][1]
import sys
n = 5
if len(sys.argv) > 1: n = int(sys.argv[1])
for terms in csum(n):
print(' '.join(map(str, terms)))
Explanation:
Let's define terms as a non-empty, increasing (can contain the same value multiple times) list of postitive integers.
The solution for n is a list of all terms in increasing lexicographical order, where the sum of each terms is n.
s[k][m] is a list of all terms in increasing lexicographical order, where the sum of each terms in n, and the first (smallest) integer in each terms is at least m.
The solution is s[n][1]. Before returning this solution, the csum function populates the s array using iterative dynamic programming.
In the inner loop, the following recursion is used: each term in s[k][m] either has at least 2 elements (f and the rest) or it has 1 element (k). In the former case, the rest is a terms, where the sum is k - f and the smallest integer is f, thus it is from s[k - f][f].
This solution is a lot faster than #Barmar's solution if n is at least 20. For example, on my Linux laptop for n = 25, it's about 400 times faster, and for n = 28, it's about 3700 times faster. For larger values of n, it gets much faster really quickly.
This solution uses more memory than #ShadowRanger's solution, because this solution creates lots of temporary lists, and it uses all of them until the very end.
How to come up with such a fast solution?
Try to find a recursive formula. (Don't write code yet!)
Sometimes recursion works only with recursive functions with multiple variables. (In our case, s[k][m] is the recursive function, k is the obvious variable implied by the problem, and m is the extra variable we had to invent.) So try to add a few variables to the recursive formula: add the minimum number of variables to make it work.
Write your code so that it computes each recursive function value exactly once (not more). For that, you may add a cache (memoization) to your recursive function, or you may use dynamic programming, i.e. populating a (multidimensional) array in the correct order, so that what is needed is already populated.

Complexity of comparing elements in two lists

I was coding a function in Python to find elements of a sorted list that exist in another sorted list and print out the results:
# assume that both lists are sorted
def compare_sorted_lists(list1, list2):
res = []
a = 0
b = 0
while a < len(list1) and b < len(list2):
if list1[a] == list2[b]:
res.append(list1[a])
a += 1
elif list1[a] < list2[b]:
a += 1
else:
b += 1
return res
I want to figure out the time complexity of comparing elements with this method.
Assuming that:
list1 has length A and the maximum number of digits/letters in a list1 element is X
list2 has length B and the maximum number of digits/letters in a list2 element is Y
For these lists I have O(A+B) time complexity when traversing them with pointers, but how would comparing elements affect the time complexity for this function (specifically, worst-case time complexity)?
Edit: 12 March 2021 16:30 - rephrased question
The comparison between two elements is constant time, so this does not affect the complexity of your whole algorithm, which you corrected identified as O(A+B).
As user1717828 pointed out, the loop takes place at most A+B times; however comparing two elements is not a constant time operation. If the numbers are fixed point precision numbers, then yes, it is; however Python integers are unbounded. Time cost of their comparison will grow linearly with respect to the number of digits in them. Therefore the time complexity of the algorithm you gave is
O((A+B) * max{X,Y})
You can actually do better than that under specific circumstances. E.g. if A << B, then the following code has O(A*log(B)*max{X,Y}) time complexity.
for a in A:
split B from the middle and keep searching a in B in one of the blocks. Continue
until you find a, or not.
because the inner loops keeps diving the list B into 2, which can last for at most log_2(B)+1 steps.

Find an element in list that appears at least 60% of the time using Divide and Conquer approach?

Input is an array that has at most one element that appears at least 60% a time. The goal is to find if this array has such an element and if yes, find that element. I came up with a divide and conquer function that finds such an element.
from collections import Counter
def CommonElement(a):
c = Counter(a)
return c.most_common(1) #Returns the element and it's frequency
def func(array):
if len(array) == 1:
return array[0]
mid = len(array)//2
left_element = func(array[:mid])
right_element = func(array[mid:])
if left_element == right_element:
return right_element
most_common_element = CommonElement(array)
element_count = most_common_element[0][1] #Getting the frequency of the element
percent = element_count/len(array)
if percent >= .6:
return most_common_element[0][0] #Returning the value of the element
else:
return None
array = [10,9,10,10,5,10,10,10,12,42,10,10,44,10,23,10] #Correctly Returns 10
array = [10,9,10,8,5,10,10,10,12,42,10,12,44,10,23,5] #Correctly Returns None
result = func(array)
print(result)
This function works but it's in O(n log(n)). I want to implement an algorithm that's O(n)
The recursion function for my algorithm is T(n) = 2T(n/2) + O(n). I think the goal is to eliminate the need to find frequency, which takes O(n). Any thoughts?
If you are guaranteed to have a list 60% of which is a given number, that number is guaranteed to be the median. To see this intuitively, sort the list. The number in question represents a contiguous window that is 60% of the length of the list. There is no way to place that window so that it doesn't cover the middle.
There are plenty of divide-and-conquer algorithms for finding the median. A common one is called introselect. You can find an implementation in numpy's partition and argpartition functions (it's written in C). The basic idea is to do quicksort, but only recurse into the portion that contains the index you care about. This reduces the algorithm to O(n).
By the way, you could search for any index between 40% and 60% of the length of the list. 50% seems like a reasonable middle ground.
To verify that the median appears > 60% of the time, run a single loop over the array, counting the number of times the median appears.
You can create a frequency counter for all elements in the list one time in O(n). Then, iterate the lookup table and see if any are at least 60% of the elements (in other words, count / len(lst) >= 0.6).
>>> from collections import Counter
>>> L = [4, 2, 3, 2, 4, 4, 4]
>>> Counter(L)
Counter({4: 4, 2: 1, 3: 1})
>>> Counter(L).most_common(1)
[(4, 4)]
>>> item, count = Counter(L).most_common(1)[0]
>>> count / len(L)
0.6666666666666666
>>> count / len(L) >= 0.6
True
Divide & conquer is a creative, but inappropriate, approach for this problem.
...or so I thought, but see this answer.
There's a pretty simple algorithm for finding the majority element of a collection, if the collection has one:
def majority(l):
count, candidate = 0, None
for element in l:
if count == 0:
count, candidate = 1, element
elif element == candidate:
count += 1
else:
count -= 1
return candidate
This algorithm essentially pairs each element of the input against another element with a different value until all unpaired elements have the same value, then returns that value. If the input has a majority element, the algorithm must return that.
You can compute a candidate with this algorithm, then make another pass through the input and see if that candidate is a 60% supermajority. This works in O(1) space and O(n) time without mutating the input, while hash-based or introselect-based algorithms would need more space or mutate the input. It's also immune to hash collision attacks (unlike Counter and other hash-based approaches) and doesn't require elements to have an order relation (unlike introselect).

Elements of Programming Interview 5.15 (Random Subset Computation)

Algorithm problem:
Write a program which takes as input a positive integer n and size
k <= n; return a size-k subset of {0, 1, 2, .. , n -1}. The subset
should be represented as an array. All subsets should be equally
likely, and in addition, all permutations of elements of the array
should be equally likely. You may assume you have a function which
takes as input a nonnegative integer t and returns an integer in the
set {0, 1,...,t-1}.
My original solution to this in pseudocode is as follows:
Set t = n, and output the result of the random number generator into a set() until set() has size(set) == t. Return list(set)
The author solution is as follows:
def online_sampling(n, k):
changed_elements = {}
for i in range(k):
rand_idx = random.randrange(i, n)
rand_idx_mapped = changed_elements.get(rand_idx, rand_idx)
i_mapped = changed_elements.get(i, i)
changed_elements[rand_idx] = i_mapped
changed_elements[i] = rand_idx_mapped
return [changed_elements[i] for i in range(k)]
I totally understand the author's solution - my question is more about why my solution is incorrect. My guess is that it becomes greatly inefficient as t approaches n, because in that case, the probability that I need to keep running the random num function until I get a number that isn't in t gets higher and higher. If t == n, for the very last element to add to set there is just a 1/n chance that I get the correct element, and would probabilistically need to run the given rand() function n times just to get the last item.
Is this the correct reason for why my solution isn't efficient? Is there anything else I'm missing? And how would one describe the time complexity of my solution then? By the above rationale, I believe would be O(n^2) since probabilistically need to run n + n - 1 + n - 2... times.
Your solution is (almost) correct.
Firstly, it will run in O(n log n) instead of O(n^2), assuming that all operations with set are O(1). Here's why.
The expected time to add the first element to the set is 1 = n/n.
The expected time to add the second element to the set is n/(n-1), because the probability to randomly choose yet unchosen element is (n-1)/n. See geometric distribution for an explanation.
...
For k-th element, the expected time is n/(n-k). So for n elements the total time is n/n + n/(n-1) + ... + n/1 = n * (1 + 1/2 + ... + 1/n) = n log n.
Moreover, we can prove by induction that all chosen subsets will be equiprobable.
However, when you do list(set(...)), it is not guaranteed the resulting list will contain elements in the same order as you put them into a set. For example, if set is implemented as a binary search tree then the list will always be sorted. So you have to store the list of unique found elements separately.
UPD (#JimMischel): we proved the average case running time. There still is a possibility that the algorithm will run indefinitely (for example, if rand() always returns 1).
Your method has a big problem. You may return duplicate numbers if you random number generator create same number two times isn't it?
If you say set() will not keep duplicate numbers, your method has created members of set with different chance. So numbers in your set will not be equally likely.
Problem with your method is not efficiency, it does not create an equally likely result set. The author uses a variation of Fisher-Yates method for creating that subset which will be equally likely.

How can I merge two lists and sort them working in 'linear' time?

I have this, and it works:
# E. Given two lists sorted in increasing order, create and return a merged
# list of all the elements in sorted order. You may modify the passed in lists.
# Ideally, the solution should work in "linear" time, making a single
# pass of both lists.
def linear_merge(list1, list2):
finalList = []
for item in list1:
finalList.append(item)
for item in list2:
finalList.append(item)
finalList.sort()
return finalList
# +++your code here+++
return
But, I'd really like to learn this stuff well. :) What does 'linear' time mean?
Linear means O(n) in Big O notation, while your code uses a sort() which is most likely O(nlogn).
The question is asking for the standard merge algorithm. A simple Python implementation would be:
def merge(l, m):
result = []
i = j = 0
total = len(l) + len(m)
while len(result) != total:
if len(l) == i:
result += m[j:]
break
elif len(m) == j:
result += l[i:]
break
elif l[i] < m[j]:
result.append(l[i])
i += 1
else:
result.append(m[j])
j += 1
return result
>>> merge([1,2,6,7], [1,3,5,9])
[1, 1, 2, 3, 5, 6, 7, 9]
Linear time means that the time taken is bounded by some undefined constant times (in this context) the number of items in the two lists you want to merge. Your approach doesn't achieve this - it takes O(n log n) time.
When specifying how long an algorithm takes in terms of the problem size, we ignore details like how fast the machine is, which basically means we ignore all the constant terms. We use "asymptotic notation" for that. These basically describe the shape of the curve you would plot in a graph of problem size in x against time taken in y. The logic is that a bad curve (one that gets steeper quickly) will always lead to a slower execution time if the problem is big enough. It may be faster on a very small problem (depending on the constants, which probably depends on the machine) but for small problems the execution time isn't generally a big issue anyway.
The "big O" specifies an upper bound on execution time. There are related notations for average execution time and lower bounds, but "big O" is the one that gets all the attention.
O(1) is constant time - the problem size doesn't matter.
O(log n) is a quite shallow curve - the time increases a bit as the problem gets bigger.
O(n) is linear time - each unit increase means it takes a roughly constant amount of extra time. The graph is (roughly) a straight line.
O(n log n) curves upwards more steeply as the problem gets more complex, but not by very much. This is the best that a general-purpose sorting algorithm can do.
O(n squared) curves upwards a lot more steeply as the problem gets more complex. This is typical for slower sorting algorithms like bubble sort.
The nastiest algorithms are classified as "np-hard" or "np-complete" where the "np" means "non-polynomial" - the curve gets steeper quicker than any polynomial. Exponential time is bad, but some are even worse. These kinds of things are still done, but only for very small problems.
EDIT the last paragraph is wrong, as indicated by the comment. I do have some holes in my algorithm theory, and clearly it's time I checked the things I thought I had figured out. In the mean time, I'm not quite sure how to correct that paragraph, so just be warned.
For your merging problem, consider that your two input lists are already sorted. The smallest item from your output must be the smallest item from one of your inputs. Get the first item from both and compare the two, and put the smallest in your output. Put the largest back where it came from. You have done a constant amount of work and you have handled one item. Repeat until both lists are exhausted.
Some details... First, putting the item back in the list just to pull it back out again is obviously silly, but it makes the explanation easier. Next - one input list will be exhausted before the other, so you need to cope with that (basically just empty out the rest of the other list and add it to the output). Finally - you don't actually have to remove items from the input lists - again, that's just the explanation. You can just step through them.
Linear time means that the runtime of the program is proportional to the length of the input. In this case the input consists of two lists. If the lists are twice as long, then the program will run approximately twice as long. Technically, we say that the algorithm should be O(n), where n is the size of the input (in this case the length of the two input lists combined).
This appears to be homework, so I will no supply you with an answer. Even though this is not homework, I am of the opinion that you will be best served by taking a pen and a piece of paper, construct two smallish example lists which are sorted, and figure out how you would merge those two lists, by hand. Once you figured that out, implementing the algorithm is a piece of cake.
(If all goes well, you will notice that you need to iterate over each list only once, in a single direction. That means that the algorithm is indeed linear. Good luck!)
If you build the result in reverse sorted order, you can use pop() and still be O(N)
pop() from the right end of the list does not require shifting the elements, so is O(1)
Reversing the list before we return it is O(N)
>>> def merge(l, r):
... result = []
... while l and r:
... if l[-1] > r[-1]:
... result.append(l.pop())
... else:
... result.append(r.pop())
... result+=(l+r)[::-1]
... result.reverse()
... return result
...
>>> merge([1,2,6,7], [1,3,5,9])
[1, 1, 2, 3, 5, 6, 7, 9]
This thread contains various implementations of a linear-time merge algorithm. Note that for practical purposes, you would use heapq.merge.
Linear time means O(n) complexity. You can read something about algorithmn comlexity and big-O notation here: http://en.wikipedia.org/wiki/Big_O_notation .
You should try to combine those lists not after getting them in the finalList, try to merge them gradually - adding an element, assuring the result is sorted, then add next element... this should give you some ideas.
A simpler version which will require equal sized lists:
def merge_sort(L1, L2):
res = []
for i in range(len(L1)):
if(L1[i]<L2[i]):
first = L1[i]
secound = L2[i]
else:
first = L2[i]
secound = L1[i]
res.extend([first,secound])
return res
itertoolz provides an efficient implementation to merge two sorted lists
https://toolz.readthedocs.io/en/latest/_modules/toolz/itertoolz.html#merge_sorted
'Linear time' means that time is an O(n) function, where n - the number of items input (items in the lists).
f(n) = O(n) means that that there exist constants x and y such that x * n <= f(n) <= y * n.
def linear_merge(list1, list2):
finalList = []
i = 0
j = 0
while i < len(list1):
if j < len(list2):
if list1[i] < list2[j]:
finalList.append(list1[i])
i += 1
else:
finalList.append(list2[j])
j += 1
else:
finalList.append(list1[i])
i += 1
while j < len(list2):
finalList.append(list2[j])
j += 1
return finalList

Categories

Resources