Efficiently Predict the Output of a Number Processing Algorithm - python

I am working on a bit of code that needs to be able to efficiently predict (preferably in O(1) time) the output of the following algorithm when presented with two ints m and n.
algorithm(m,n):
history = set()
while True:
if (m,n) in history:
return False
elif n == m:
return True
else:
history.add((m,n))
if m>n:
x = m-n
y = 2*n
m = x
n = y
else:
x = 2*m
y = n-m
m = x
n = y
Note that when (m,n) appears in the following algorithm's history, you've entered an infinite loop (i.e. 2,1 -> 1,2 -> 2,1...); when m==n the algorithm can proceed only one step further and must terminate (i.e. 5,5 -> 10,0 -> 10,0...). Essentially I need to be able to predict if m(current) and n(current) will ever match.
PS, if this algorithm has a name I'd love to know it. Furthermore, if there exists good reading on this topic (predicting numerical sequences, etc...) I'd love to be directed to it.

Assuming positive integer input, this algorithm will return True if and only if (m+n)/gcd(m, n) is a power of two.
Proof sketch:
Divide both m and n by gcd(m, n) at the start of the algorithm; this will not change the return value.
If the sum of m and n is divisible by an odd prime p after doing this, then both m and n need to become divisible by p for the algorithm to return True, but neither m nor n can do so.
If the sum of m and n is a power of two, then both m and n will become divisible by another factor of 2 on each iteration until both are equal.

First of all, let's reduce the update step to a single line. On each iteration, m updates to the absolute difference; n updates to twice the smaller number.
else:
history.add((m,n))
m, n = abs(m-n), 2 * min(m, n)
This highlights the non-linearity of the iteration. Each update breaks into the two classes you first programmed; the recurrence breaks into multiple classes on each further iteration.
I believe that the short answer for this is no -- you cannot predict the outcome in a time reasonably shorter than simply executing the algorithm.
The division point for switching large vs smaller is when one number is 3 times the other. In that space, the algorithm closes the gap simply: subtract the smaller form the larger, then double the smaller. Once they get within the 3x range, the system quickly turns chaotic: you cannot state that two nearby pairs will have results that remain nearby as the algorithm progresses, not for any adjacent pairs.

Related

Big O Complexity Python, find out running time for input of size 'n'

For this functon, I am to select the most appropriate Big-O running time for input of size n:
def 2d_list(n):
i = 0
data = []
while i < n:
data.append([i] * n)
i += 1
return data
The above function takes an integer as a parameter and creates a 2d list of integers. For example, the following code fragment:
print(2d_list(3))
produces
[[0, 0, 0], [1, 1, 1], [2, 2, 2]]
Select one:
a. O(n^3)
b. O(n)
c. O(n log n)
d. O(n^2)
e. O(log n)
I think the answer should be d. 0(n^2). Is this right?
The number of times the while loop runs is linearly related to the size of n, and in each of these iterations, [i]*n creates a list and adds i to it n times, which is another iteration. One iteration nested inside another is O(n^2) time complexity.
None of the given options is correct. Let's rewrite the function ever so slightly: we're only going to rename the parameter from n to x:
def 2d_list(x):
i = 0
data = []
while i < x:
data.append([i] * x)
i += 1
return data
Now how many operations do we have? There are still O(x^2) operations: For each of the x values of i, we have to create a list with x elements.
But what's the input size n? That's the number of bits you need to represent the number x. The number x grows much faster than n: you can basically double the size of x without adding more than 1 bit to the input.
As a result, you have n = log x, or x = 2**n. This means the time complexity of your function is actually O(4**n).
The question is misleading. The input size is log n i.e. the number of bits required to represent n.
If we want to compute the complexity in terms of n instead of input size, then the answer is O(n^2)
If we want to compute in terms of input size s, then the answer is O((2^s)^2) = O(2^(2*s)) = O((2^2)^s), as chepner wrote.

Fastest way to sample most numbers with minimum difference larger than a value from a Python list

Given a list of 20 float numbers, I want to find a largest subset where any two of the candidates are different from each other larger than a mindiff = 1.. Right now I am using a brute-force method to search from largest to smallest subsets using itertools.combinations. As shown below, the code finds a subset after 4 s for a list of 20 numbers.
from itertools import combinations
import random
from time import time
mindiff = 1.
length = 20
random.seed(99)
lst = [random.uniform(1., 10.) for _ in range(length)]
t0 = time()
n = len(lst)
sample = []
found = False
while not found:
# get all subsets with size n
subsets = list(combinations(lst, n))
# shuffle to ensure randomness
random.shuffle(subsets)
for subset in subsets:
# sort the subset numbers
ss = sorted(subset)
# calculate the differences between every two adjacent numbers
diffs = [j-i for i, j in zip(ss[:-1], ss[1:])]
if min(diffs) > mindiff:
sample = set(subset)
found = True
break
# check subsets with size -1
n -= 1
print(sample)
print(time()-t0)
Output:
{2.3704888087015568, 4.365818049020534, 5.403474619948962, 6.518944556233767, 7.8388969285727015, 9.117993839791751}
4.182451486587524
However, in reality I have a list of 200 numbers, which is infeasible for a brute-froce enumeration. I want a fast algorithm to sample just one random largest subset with a minimum difference larger than 1. Note that I want each sample has randomness and maximum size. Any suggestions?
My previous answer assumed you simply wanted a single optimal solution, not a uniform random sample of all solutions. This answer assumes you want one that samples uniformly from all such optimal solutions.
Construct a directed acyclic graph G where there is one node for each point, and nodes a and b are connected when b - a > mindist. Also add two virtual nodes, s and t, where s -> x for all x and x -> t for all x.
Calculate for each node in G how many paths of length k exist to t. You can do this efficiently in O(n^2 k) time using dynamic programming with a table P[x][k], filling initially P[x][0] = 0 except P[t][0] = 1, and then P[x][k] = sum(P[y][k-1] for y in neighbors(x)).
Keep doing this until you reach the maximum k - you now know the size of the optimal subset.
Uniformly sample a path of length k from s to t using P to weight your choices.
This is done by starting at s. We then look at each neighbor of s and choose one randomly with a weighting dictated by P[s][k]. This gives us our first element of the optimal set.
We then repeatedly perform this step. We are at x, look at the neighbors of x and pick one randomly using weights P[x][k-i] where i is the step we're at.
Use the nodes you sampled in 3 as your random subset.
An implementation of the above in pure Python:
import random
def sample_mindist_subset(xs, mindist):
# Construct directed graph G.
n = len(xs)
s = n; t = n + 1 # Two virtual nodes, source and sink.
neighbors = {
i: [t] + [j for j in range(n) if xs[j] - xs[i] > mindist]
for i in range(n)}
neighbors[s] = [t] + list(range(n))
neighbors[t] = []
# Compute number of paths P[x][k] from x to t of length k.
P = [[0 for _ in range(n+2)] for _ in range(n+2)]
P[t][0] = 1
for k in range(1, n+2):
for x in range(n+2):
P[x][k] = sum(P[y][k-1] for y in neighbors[x])
# Sample maximum length path uniformly at random.
maxk = max(k for k in range(n+2) if P[s][k] > 0)
path = [s]
while path[-1] != t:
candidates = neighbors[path[-1]]
weights = [P[cn][maxk-len(path)] for cn in candidates]
path.append(random.choices(candidates, weights)[0])
return [xs[i] for i in path[1:-1]]
Note that if you want to sample from the same set of numbers many times, you don't have to recompute P every single time and can re-use it.
I probably don't fully understand the question, because right now the solution is quite trivial. EDIT: yes, I misunderstood after all, the OP does not just want an optimal solution, but wishes to randomly sample from the set of optimal solutions. This answer is not incorrect but it also is an answer to a different question than what OP is interested in.
Simply sort the numbers and greedily construct the subset:
def mindist_subset(xs, mindist):
result = []
for x in sorted(xs):
if not result or x - result[-1] > mindist:
result.append(x)
return result
Sketch of proof of correctness.
Suppose we have a solution S given input array A that is of optimal size. If it does not contain min(A) note that we could remove min(S) from S and add min(A) since this would only increase the distance between min(S) and the second smallest number in S. Conclusion: we can without loss of generality assume that min(A) is part of an optimal solution.
Now we can apply this argument recursively. We add min(A) to a solution and remove all elements too close to min(A), giving remaining elements A'. Then we're left with a subproblem where exactly the same argument applies, we can choose min(A') as our next element of the solution, etc.

How to understand leetcode 494 Target Sum ( knapsack problem ) fastest python code using bit operation

The problem is described as follows
https://leetcode.com/problems/target-sum/
You are given a list of non-negative integers, a1, a2, ..., an, and a target, S. Now you have 2 symbols + and -. For each integer, you should choose one from + and - as its new symbol.
Find out how many ways to assign symbols to make sum of integers equal to target S.
Constraints:
The length of the given array is positive and will not exceed 20.
The sum of elements in the given array will not exceed 1000.
Your output answer is guaranteed to be fitted in a 32-bit integer.
I find this submission in leetcode submission detail Accepted Solutions Runtime Distribution
class Solution:
def findTargetSumWays(self, nums, S):
a = sum(nums) - S
if a < 0 or a%2==1: return 0
S = [((1<<(i*21))+1) for i in nums]
return reduce(lambda p,i:(p*i)&(1<<((a//2+1)*21))-1,S,1)>>(21*a//2)
Simplify reduce, it becomes
class Solution:
def findTargetSumWays(self, nums, S):
a = sum(nums) - S
if a < 0 or a%2==1: return 0
auxarr = [((1<<(i*21))+1) for i in nums]
ret=1
for i in auxarr:
ret= (ret*i)&(1<<((a//2+1)*21))-1
return ret>>(21*a//2)
It transforms the original problem into another problem that finds the number of selections that select some nums[i] that their sum is (sum(nums)-S)/2.
I know how to solve such knapsack problems with dp, but I can't understand the above code, I am very curious how such code works, please help me.
# my dp code
class Solution:
def findTargetSumWays(self, nums: List[int], S: int) -> int:
S=sum(nums)-S
if S%2!=0 or S<0: return 0
S//=2
dp=[0]*(S+1)
dp[0]=1
for c in nums:
for j in range(S,c-1,-1):
dp[j]+=dp[j-c]
return dp[S]
It seems to use characteristics of a polynomial where you multiply terms formed of (B^n+1) where B is a power of 2 large enough to avoid overlapping.
So, let's say you have 3 numbers (x,y,z) to add, it will compute:
(B^x + 1)(B^y + 1)(B^z + 1)
The exponents of these polynomials will add up in the result
B^(x+y+z) + B^(x+z) + B^(y+z) + B^z + B^(x+y) + B^x + B^y + 1
So, if any combination of exponent (i.e. numbers) adds up to the same total, the number of times B^total occurs will be the number of ways to obtain that total. Leveraging this characteristic of polynomials, we will find ways*B^total in the result. If the number of ways does not overlap with the value of B^(total+1), it can be extracted using masks and integer divisions.
For example, given 4 numbers h,i,j,k, the products will produce the sum of B raised to powers corresponding to every combination of 1 up to 4 of the numbers added together. So, if we are looking for a total of T, and h+i and j+k equal T. Then the product will contain 2*B^T formed by B^(h+i) + B^(j+k). This corresponds to two ways to form the sum T.
Given that there are 2 possibility for each number (+ or -), there is a maximum of 2^20 possible ways to combine them. To make sure that any sum of B^x does not overlap with B^(x+1), the value of 2^21 is chosen for B.
This is why the offset array (variable name S is a really poor choice here) is formed of (B^n+1) for each n in nums, where B is 2^21, so (2^21^n+1) ... (2^(21n)+1) ... (1<<(21*n))+1
To be able to use the polynomial approach, the problem needs to be converted to an Absence/Presence problem. This is done by reasoning that there has to be a combination of numbers that produces a zero sum by cancelling each other out, leaving the rest to be positive and add up to S. So, if we remove S from the total of numbers, there will be a combination that adds up to half of what remains (a//2). This will be the total we will be looking for.
The reduce function implements the polynomial product and applies a mask ((1<<((a//2+1)*21))-1) to cut off any power of B that is beyond B^(a/2). The final result cuts off the part below B^(a/2) by shifting bits.
This results in the multiple of B^(a/2) which corresponds to the number of ways to produce the sum of exponents (i.e the sum of numbers).

Elements of Programming Interview 5.15 (Random Subset Computation)

Algorithm problem:
Write a program which takes as input a positive integer n and size
k <= n; return a size-k subset of {0, 1, 2, .. , n -1}. The subset
should be represented as an array. All subsets should be equally
likely, and in addition, all permutations of elements of the array
should be equally likely. You may assume you have a function which
takes as input a nonnegative integer t and returns an integer in the
set {0, 1,...,t-1}.
My original solution to this in pseudocode is as follows:
Set t = n, and output the result of the random number generator into a set() until set() has size(set) == t. Return list(set)
The author solution is as follows:
def online_sampling(n, k):
changed_elements = {}
for i in range(k):
rand_idx = random.randrange(i, n)
rand_idx_mapped = changed_elements.get(rand_idx, rand_idx)
i_mapped = changed_elements.get(i, i)
changed_elements[rand_idx] = i_mapped
changed_elements[i] = rand_idx_mapped
return [changed_elements[i] for i in range(k)]
I totally understand the author's solution - my question is more about why my solution is incorrect. My guess is that it becomes greatly inefficient as t approaches n, because in that case, the probability that I need to keep running the random num function until I get a number that isn't in t gets higher and higher. If t == n, for the very last element to add to set there is just a 1/n chance that I get the correct element, and would probabilistically need to run the given rand() function n times just to get the last item.
Is this the correct reason for why my solution isn't efficient? Is there anything else I'm missing? And how would one describe the time complexity of my solution then? By the above rationale, I believe would be O(n^2) since probabilistically need to run n + n - 1 + n - 2... times.
Your solution is (almost) correct.
Firstly, it will run in O(n log n) instead of O(n^2), assuming that all operations with set are O(1). Here's why.
The expected time to add the first element to the set is 1 = n/n.
The expected time to add the second element to the set is n/(n-1), because the probability to randomly choose yet unchosen element is (n-1)/n. See geometric distribution for an explanation.
...
For k-th element, the expected time is n/(n-k). So for n elements the total time is n/n + n/(n-1) + ... + n/1 = n * (1 + 1/2 + ... + 1/n) = n log n.
Moreover, we can prove by induction that all chosen subsets will be equiprobable.
However, when you do list(set(...)), it is not guaranteed the resulting list will contain elements in the same order as you put them into a set. For example, if set is implemented as a binary search tree then the list will always be sorted. So you have to store the list of unique found elements separately.
UPD (#JimMischel): we proved the average case running time. There still is a possibility that the algorithm will run indefinitely (for example, if rand() always returns 1).
Your method has a big problem. You may return duplicate numbers if you random number generator create same number two times isn't it?
If you say set() will not keep duplicate numbers, your method has created members of set with different chance. So numbers in your set will not be equally likely.
Problem with your method is not efficiency, it does not create an equally likely result set. The author uses a variation of Fisher-Yates method for creating that subset which will be equally likely.

Finding numbers from a to b not divisible by x to y

This is a problem I've been pondering for quite some time.
What is the fastest way to find all numbers from a to b that are not divisible by any number from x to y?
Consider this:
I want to find all the numbers from 1 to 10 that are not divisible by 2 to 5.
This process will become extremely slow if I where to use a linear approach;
Like this:
result = []
a = 1
b = 10
x = 2
y = 5
for i in range(a,b):
t = False
for j in range(x,y):
if i%j==0:
t = True
break
if t is False:
result.append(i)
return result
Does anybody know of any other methods of doing this with less computation time than a linear solution?
If not, can anyone see how this might be done faster, as I am blank at this point...
Sincerely,
John
[EDIT]
The range of the number are 0 to >1,e+100
This is true for a, b, x and y
You only need to check prime values in the range of the possible divisors - for example, if a value is not divisible by 2, it won't be divisible by any multiple of 2 either; likewise for every other prime and prime multiple. Thus in your example you can check 2, 3, 5 - you don't need to check 4, because anything divisible by 4 must be divisible by 2. Hence, a faster approach would be to compute primes in whatever range you are interested in, and then simply calculate which values they divide.
Another speedup is to add each value in the range you are interested in to a set: when you find that it is divisible by a number in your range, remove it from the set. You then should only be testing numbers that remain in the set - this will stop you testing numbers multiple times.
If we combine these two approaches, we see that we can create a set of all values (so in the example, a set with all values 1 to 10), and simply remove the multiples of each prime in your second range from that set.
Edit: As Patashu pointed out, this won't quite work if the prime that divides a given value is not in the set. To fix this, we can apply a similar algorithm to the above: create a set with values [a, b], for each value in the set, remove all of its multiples. So for the example given below in the comments (with [3, 6]) we'd start with 3 and remove it's multiples in the set - so 6. Hence the remaining values we need to test would be [3, 4, 5] which is what we want in this case.
Edit2: Here's a really hacked up, crappy implementation that hasn't been optimized and has horrible variable names:
def find_non_factors():
a = 1
b = 1000000
x = 200
y = 1000
z = [True for p in range(x, y+1)]
for k, i in enumerate(z):
if i:
k += x
n = 2
while n * k < y + 1:
z[(n*k) - x] = False
n += 1
k = {p for p in range(a, b+1)}
for p, v in enumerate(z):
if v:
t = p + x
n = 1
while n * t < (b + 1):
if (n * t) in k:
k.remove(n * t)
n += 1
return k
Try your original implementation with those numbers. It takes > 1 minute on my computer. This implementation takes under 2 seconds.
Ultimate optimization caveat: Do not pre-maturely optimize. Any time you attempt to optimize code, profile it to ensure it needs optimization, and profile the optimization on the same kind of data you intend it to be optimized for to confirm it is a speedup. Almost all code does not need optimization, just to give the correct answer.
If you are optimizing for small x-y and large a-b:
Create an array with length that is the lowest common multiple out of all the x, x+1, x+2... y. For example, for 2, 3, 4, 5 it would be 60, not 120.
Now populate this array with booleans - false initially for every cell, then for each number in x-y, populate all entries in the array that are multiples of that number with true.
Now for each number in a-b, index into the array modulo arraylength and if it is true, skip else if it is false, return.
You can do this a little quicker by removing from you x to y factors numbers whos prime factor expansions are strict supersets of other numbers' prime factor expansions. By which I mean - if you have 2, 3, 4, 5, 4 is 2*2 a strict superset of 2 so you can remove it and now our array length is only 30. For something like 3, 4, 5, 6 however, 4 is 2*2 and 6 is 3*2 - 6 is a superset of 3 so we remove it, but 4 is not a superset of everything so we keep it in. LCM is 3*2*2*5 = 60. Doing this kind of thing would give some speed up on its own for large a-b, and you might not need to go the array direction if that's all you need.
Also, keep in mind that if you aren't going to use the entire result of the function every single time - like, maybe sometimes you're only interested in the lowest value - write it as a generator rather than as a function. That way you can call it until you have enough numbers and then stop, saving time.

Categories

Resources