How can I choose random indicies based on probability?

How can I choose random indicies based on probability? - python

I have a list of numbers and I'm trying to write a function that will choose n random indices i such i's likelihood is percentages[i]
Function:
def choose_randomly(probabilities, n):
percentages = accumulated_s(probabilities)
result = []
for i in range(n):
r = random()
for j in range(n):
if r < percentages[j]:
result = result + [j]
return result
accumulated_s will just generate a corresponding list of probabilities.
I'm expecting results like this:
choose_randomly([1, 2, 3, 4], 2) -> [3 3 0]
choose_randomly([1, 2, 3, 4], 2) -> [1 3 1]
The problem is that this is not returning n indicies. Can anyone point out what I'm doing wrong?
Thank you so much!

Once you've found the right range of probabilities, you're done; break out of the inner loop to generate the next value, or you'll act as if all probabilities above the correct threshold were matched as well:
# Enumerate all percentages, not just first n
for j, pct in enumerate(percentages):
if r < pct:
result.append(j) # Don't create tons of temporary lists; mutate in place
break # <-- Don't add more results
Also note, if you have a lot of values in the set of probabilities, it may make sense to use functions from the bisect module to find the correct value, rather than scanning linearly each time; for a small number of entries in percentages, linear scanning is fine, but for a large number, O(log n) lookups may beat O(n) scans.

Related

Find missing elements in a list created from a sequence of consecutive integers with duplicates in O(n)

This is a Find All Numbers Disappeared in an Array problem from LeetCode:
Given an array of integers where 1 ≤ a[i] ≤ n (n = size of array),
some elements appear twice and others appear once.
Find all the elements of [1, n] inclusive that do not appear in this array.
Could you do it without extra space and in O(n) runtime? You may
assume the returned list does not count as extra space.
Example:
Input:
[4,3,2,7,8,2,3,1]
Output:
[5,6]
My code is below - I think its O(N) but interviewer disagrees
def findDisappearedNumbers(self, nums: List[int]) -> List[int]:
results_list=[]
for i in range(1,len(nums)+1):
if i not in nums:
results_list.append(i)
return results_list

You can implement an algorithm where you loop through each element of the list and set each element at index i to a negative integer if the list contains the element i as one of the values,. You can then add each index i which is positive to your list of missing items. It doesn't take any additional space and uses at the most 3 for loops(not nested), which makes the complexity O(3*n), which is basically O(n). This site explains it much better and also provides the source code.
edit- I have added the code in case someone wants it:
#The input list and the output list
input = [4, 5, 3, 3, 1, 7, 10, 4, 5, 3]
missing_elements = []
#Loop through each element i and set input[i - 1] to -input[i - 1]. abs() is necessary for
#this or it shows an error
for i in input:
if(input[abs(i) - 1] > 0):
input[abs(i) - 1] = -input[abs(i) - 1]
#Loop through the list again and append each positive value to output list
for i in range(0, len(input)):
if input[i] > 0:
missing_elements.append(i + 1)

For me using loops is not the best way to do it because loops increase the complexity of the given problem. You can try doing it with sets.
def findMissingNums(input_arr):
max_num = max(input_arr) # get max number from input list/array
input_set = set(input_arr) # convert input array into a set
set_num = set(range(1,max(input_arr)+1)) #create a set of all num from 1 to n (n is the max from the input array)
missing_nums = list(set_num - input_set) # take difference of both sets and convert to list/array
return missing_nums
input_arr = [4,3,2,7,8,2,3,1] # 1 <= input_arr[i] <= n
print(findMissingNums(input_arr)) # outputs [5 , 6]```

Use hash table, or dictionary in Python:
def findDisappearedNumbers(self, nums):
hash_table={}
for i in range(1,len(nums)+1):
hash_table[i] = False
for num in nums:
hash_table[num] = True
for i in range(1,len(nums)+1):
if not hash_table[i]:
print("missing..",i)

Try the following :
a=input() #[4,3,2,7,8,2,3,1]
b=[x for x in range(1,len(a)+1)]
c,d=set(a),set(b)
print(list(d-c))

Python - Calculating total possibilities using recursion

I am trying to find the total possibilities of how to place 90 apples in 90 boxes. Any amount of apples can be placed in one box (0 to 90 apples), but all apples have to be placed into boxes. I used recursion but it took way too much time to complete the calculation. I was only able to test my code with small amounts of apples and boxes. Could anyone help me on reduce the time complexity of my code? Thanks in advance.
import math
boxes = 3
apples = 3
def possibilities(apples, boxes):
if apples == 0:
return 1
if boxes == 0:
return 0
start_point = 0 if boxes > 1 else math.floor(apples/boxes)
p = 0
for n in range(start_point, apples+1):
p += possibilities(apples-n, boxes-1)
return p
t = possibilities(apples,boxes)
print(t)

The way I see it, the problem consists in finding the number of sorted list of max 90 elements which have a sum equal to 90.
There is a concept which is quite close to this and we call it the partitions of a number.
For example, the partitions of 4 are [4], [3, 1], [2, 2], [2, 1, 1], [1, 1, 1, 1].
After a bit of research I found this article which is relevant to your problem.
As explained in there, the recursion method results in a very long calculation for large numbers, but...
A much more efficient approach is via an approach called dynamic programming. Here we compute a function psum(n,k), which is the total number of n-partitions with largest component of k or smaller. At any given stage we will have computed the values of psum(1,k), psum(2,k), psum(3,k), ..., psum(n,k) for some fixed k. Given this vector of n values we compute the values for k+1 as follows:
psum(i,k+1) = psum(i,k) + p(i,k) for any value i
But recall that p(i,k) = Σj p(i-k,j) = psum(i-k,k)
So psum(i,k+1) = psum(i,k) + psum(i-k,k)
So with a little care we can reuse the vector of values and compute the values of psum(i,k) in a rolling value for successively greater values of k. Finally, we have a vector whose values are psum(i,n). The value psum(n,n) is the desired value p(n). As an additional benefit we see that we have simultaneously computed the values of p(1), p(2), ..., p(n).
Basically, if you keep the intermediate values in a list and use the recurrence presented in the article,
psum(i,k+1) = psum(i,k) + psum(i-k,k)
you can use the following function:
def partitionp(n):
partpsum = [1] * (n + 1)
for i in range(2, n + 1):
for j in range(i, n + 1):
partpsum[j] += partpsum[j - i]
return partpsum[n]
At each iteration of the outer for loop, the list partpsum contains all the value psum(1,k), psum(2,k), psum(3,k), ..., psum(n,k). At the end of the iteration, you only need to return psum(n,n).

Sample Online Data Algorithm Analysis [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I am going through a book called "Elements of Programming Interview" and have gotten stuck at the following problem:
Implement an algorithm that takes as input an array of distinct
elements and a size, and returns a subset of the given size of the
array elements. All subsets should be equally likely. Return the
result in input array itself.
The solution they provide below is:
import random
def random_sampling(k, A):
for i in range(k):
# Generate a random index in [i, len(A) - 1].
r = random.randint(i, len(A) - 1)
A[i], A[r] = A[r], A[i]
A = [3, 7, 5, 11]
k = 3
print(random_sampling(k, A))
I so not understand what the authors are trying to do intuitively. Their explanation is below
Another approach is to enumerate all subsets of size k and then select
one at random from these. Since there are (n to k) subsets of size k,
the time and space complexity are huge. The key to efficiently
building a random subset of size exactly k is to first build one of
size k - 1 and then adding one more element, selected randomly from
the rest. The problem is trivial when k = 1. We make one call to the
random number generator, take the returned value mod n (call it r),
and swap A[0] with A[r]. The entry A[0] now holds the result.
For k > 1, we begin by choosing one element at random as above and we
now repeat the same process with n - 1 element sub-array A[1, n -1].
Eventually, the random subset occupies the slots A[0, k - 1] and the
remaining elements are in the last n - k slots.
Intuitively, if all subsets of size k are equally likely, then the
construction process ensures that the subset of size k + 1 are also
equally likely. A formal proof for this uses mathematical induction -
the induction hypothesis is that every permutation of every size k
subset of A is equally likely to be in A[0, k -1].
As a concrete example, let the input be A = <3, 7, 5, 11> and the size
be 3. In the first iteration, we use the random number generator to
pick a random integer in the interval [0,3]. Let the returned random
number be 2. We swap A[0] with A[2] - now the array is <5, 7, 3, 11>.
Now we pick a random integer in the interval [1, 3]. Let the returned
random number be 3. We swap A[1] with A[3] - now the resulting array
is <5, 11, 3, 7>. Now we pick a random integer in the interval [2,3].
Let the returned random number be 2. When we swap A[2] with itself the
resulting array is unchanged. The random subset consists of he first
three entries, ie., {5, 11, 3}.
Sorry for the long text; my questions are this
What is the key to efficiency they are referring to? Its not clicking in my head
What did they mean by "eventually, the random subset occupies the slots A[0, k-1] and the remaining elements are in the last n - k slots"
is there a clear reason why "every permutation of every size k subset of A is equally likely to be in A[0, k - 1]"?
Can you explain the theory behind the algorithm in clearer terms?
What is the return of the algorithm supposed to be?
thanks

an intuitive solution might be
def random_sampling(k, A):
subset = []
selected = set()
for i in range(k):
index = random.randint(0, len(A) - 1)
while index in selected:
index = random.randint(0, len(A) - 1)
selected.add(index)
subset.append([A[index]])
return subset
but its not clear that every k subset has equal probability (because for the same k you may use different number of randoms on different ranges)
so a solution that fit the probability condition will be
import itertools as it
def random_sampling(k, A):
index_posibilities = [i for i in it.combinations(A,k)] #very expansive action
index = random.randint(0, len(index_posibilities) - 1)
selected = []
for i in index:
selected.append(A[i])
return selected
so the solution they gave makes sure you use the same procedure of randoms for every set of k elements without the brute force above
the order of the list is now, first k elements are these we selected, the rest of the list are the remaining items
this is the induction assumption, I assume that every set in length k-1 has the same probability and proof it for set of length k.
an efficient way to make sure the same probability for every k size sub set, is to do exactly the same steps to produce it
no return value because the list is being changed in the function is also changed in main, the subset is the first k elements of the list after the function being called

Random contiguous slice of list in Python based on a single random integer

Using a single random number and a list, how would you return a random slice of that list?
For example, given the list [0,1,2] there are seven possibilities of random contiguous slices:
[ ]
[ 0 ]
[ 0, 1 ]
[ 0, 1, 2 ]
[ 1 ]
[ 1, 2]
[ 2 ]
Rather than getting a random starting index and a random end index, there must be a way to generate a single random number and use that one value to figure out both starting index and end/length.
I need it that way, to ensure these 7 possibilities have equal probability.

Simply fix one order in which you would sort all possible slices, then work out a way to turn an index in that list of all slices back into the slice endpoints. For example, the order you used could be described by
The empty slice is before all other slices
Non-empty slices are ordered by their starting point
Slices with the same starting point are ordered by their endpoint
So the index 0 should return the empty list. Indices 1 through n should return [0:1] through [0:n]. Indices n+1 through n+(n-1)=2n-1 would be [1:2] through [1:n]; 2n through n+(n-1)+(n-2)=3n-3 would be [2:3] through [2:n] and so on. You see a pattern here: the last index for a given starting point is of the form n+(n-1)+(n-2)+(n-3)+…+(n-k), where k is the starting index of the sequence. That's an arithmetic series, so that sum is (k+1)(2n-k)/2=(2n+(2n-1)k-k²)/2. If you set that term equal to a given index, and solve that for k, you get some formula involving square roots. You could then use the ceiling function to turn that into an integral value for k corresponding to the last index for that starting point. And once you know k, computing the end point is rather easy.
But the quadratic equation in the solution above makes things really ugly. So you might be better off using some other order. Right now I can't think of a way which would avoid such a quadratic term. The order Douglas used in his answer doesn't avoid square roots, but at least his square root is a bit simpler due to the fact that he sorts by end point first. The order in your question and my answer is called lexicographical order, his would be called reverse lexicographical and is often easier to handle since it doesn't depend on n. But since most people think about normal (forward) lexicographical order first, this answer might be more intuitive to many and might even be the required way for some applications.
Here is a bit of Python code which lists all sequence elements in order, and does the conversion from index i to endpoints [k:m] the way I described above:
from math import ceil, sqrt
n = 3
print("{:3} []".format(0))
for i in range(1, n*(n+1)//2 + 1):
b = 1 - 2*n
c = 2*(i - n) - 1
# solve k^2 + b*k + c = 0
k = int(ceil((- b - sqrt(b*b - 4*c))/2.))
m = k + i - k*(2*n-k+1)//2
print("{:3} [{}:{}]".format(i, k, m))
The - 1 term in c doesn't come from the mathematical formula I presented above. It's more like subtracting 0.5 from each value of i. This ensures that even if the result of sqrt is slightly too large, you won't end up with a k which is too large. So that term accounts for numeric imprecision and should make the whole thing pretty robust.
The term k*(2*n-k+1)//2 is the last index belonging to starting point k-1, so i minus that term is the length of the subsequence under consideration.
You can simplify things further. You can perform some computation outside the loop, which might be important if you have to choose random sequences repeatedly. You can divide b by a factor of 2 and then get rid of that factor in a number of other places. The result could look like this:
from math import ceil, sqrt
n = 3
b = n - 0.5
bbc = b*b + 2*n + 1
print("{:3} []".format(0))
for i in range(1, n*(n+1)//2 + 1):
k = int(ceil(b - sqrt(bbc - 2*i)))
m = k + i - k*(2*n-k+1)//2
print("{:3} [{}:{}]".format(i, k, m))

It is a little strange to give the empty list equal weight with the others. It is more natural for the empty list to be given weight 0 or n+1 times the others, if there are n elements on the list. But if you want it to have equal weight, you can do that.
There are n*(n+1)/2 nonempty contiguous sublists. You can specify these by the end point, from 0 to n-1, and the starting point, from 0 to the endpoint.
Generate a random integer x from 0 to n*(n+1)/2.
If x=0, return the empty list. Otherwise, x is unformly distributed from 1 through n(n+1)/2.
Compute e = floor(sqrt(2*x)-1/2). This takes the values 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, etc.
Compute s = (x-1) - e*(e+1)/2. This takes the values 0, 0, 1, 0, 1, 2, 0, 1, 2, 3, ...
Return the interval starting at index s and ending at index e.
(s,e) takes the values (0,0),(0,1),(1,1),(0,2),(1,2),(2,2),...
import random
import math
n=10
x = random.randint(0,n*(n+1)/2)
if (x==0):
print(range(n)[0:0]) // empty set
exit()
e = int(math.floor(math.sqrt(2*x)-0.5))
s = int(x-1 - (e*(e+1)/2))
print(range(n)[s:e+1]) // starting at s, ending at e, inclusive

First create all possible slice indexes.
[0:0], [1:1], etc are equivalent, so we include only one of those.
Finally you pick a random index couple, and apply it.
import random
l = [0, 1, 2]
combination_couples = [(0, 0)]
length = len(l)
# Creates all index couples.
for j in range(1, length+1):
for i in range(j):
combination_couples.append((i, j))
print(combination_couples)
rand_tuple = random.sample(combination_couples, 1)[0]
final_slice = l[rand_tuple[0]:rand_tuple[1]]
print(final_slice)
To ensure we got them all:
for i in combination_couples:
print(l[i[0]:i[1]])
Alternatively, with some math...
For a length-3 list there are 0 to 3 possible index numbers, that is n=4. You have 2 of them, that is k=2. First index has to be smaller than second, therefor we need to calculate the combinations as described here.
from math import factorial as f
def total_combinations(n, k=2):
result = 1
for i in range(1, k+1):
result *= n - k + i
result /= f(k)
# We add plus 1 since we included [0:0] as well.
return result + 1
print(total_combinations(n=4)) # Prints 7 as expected.

there must be a way to generate a single random number and use that one value to figure out both starting index and end/length.
It is difficult to say what method is best but if you're only interested in binding single random number to your contiguous slice you can use modulo.
Given a list l and a single random nubmer r you can get your contiguous slice like that:
l[r % len(l) : some_sparkling_transformation(r) % len(l)]
where some_sparkling_transformation(r) is essential. It depents on your needs but since I don't see any special requirements in your question it could be for example:
l[r % len(l) : (2 * r) % len(l)]
The most important thing here is that both left and right edges of the slice are correlated to r. This makes a problem to define such contiguous slices that wont follow any observable pattern. Above example (with 2 * r) produces slices that are always empty lists or follow a pattern of [a : 2 * a].
Let's use some intuition. We know that we want to find a good random representation of the number r in a form of contiguous slice. It cames out that we need to find two numbers: a and b that are respectively left and right edges of the slice. Assuming that r is a good random number (we like it in some way) we can say that a = r % len(l) is a good approach.
Let's now try to find b. The best way to generate another nice random number will be to use random number generator (random or numpy) which supports seeding (both of them). Example with random module:
import random
def contiguous_slice(l, r):
random.seed(r)
a = int(random.uniform(0, len(l)+1))
b = int(random.uniform(0, len(l)+1))
a, b = sorted([a, b])
return l[a:b]
Good luck and have fun!

Finding numbers from a to b not divisible by x to y

This is a problem I've been pondering for quite some time.
What is the fastest way to find all numbers from a to b that are not divisible by any number from x to y?
Consider this:
I want to find all the numbers from 1 to 10 that are not divisible by 2 to 5.
This process will become extremely slow if I where to use a linear approach;
Like this:
result = []
a = 1
b = 10
x = 2
y = 5
for i in range(a,b):
t = False
for j in range(x,y):
if i%j==0:
t = True
break
if t is False:
result.append(i)
return result
Does anybody know of any other methods of doing this with less computation time than a linear solution?
If not, can anyone see how this might be done faster, as I am blank at this point...
Sincerely,
John
[EDIT]
The range of the number are 0 to >1,e+100
This is true for a, b, x and y

You only need to check prime values in the range of the possible divisors - for example, if a value is not divisible by 2, it won't be divisible by any multiple of 2 either; likewise for every other prime and prime multiple. Thus in your example you can check 2, 3, 5 - you don't need to check 4, because anything divisible by 4 must be divisible by 2. Hence, a faster approach would be to compute primes in whatever range you are interested in, and then simply calculate which values they divide.
Another speedup is to add each value in the range you are interested in to a set: when you find that it is divisible by a number in your range, remove it from the set. You then should only be testing numbers that remain in the set - this will stop you testing numbers multiple times.
If we combine these two approaches, we see that we can create a set of all values (so in the example, a set with all values 1 to 10), and simply remove the multiples of each prime in your second range from that set.
Edit: As Patashu pointed out, this won't quite work if the prime that divides a given value is not in the set. To fix this, we can apply a similar algorithm to the above: create a set with values [a, b], for each value in the set, remove all of its multiples. So for the example given below in the comments (with [3, 6]) we'd start with 3 and remove it's multiples in the set - so 6. Hence the remaining values we need to test would be [3, 4, 5] which is what we want in this case.
Edit2: Here's a really hacked up, crappy implementation that hasn't been optimized and has horrible variable names:
def find_non_factors():
a = 1
b = 1000000
x = 200
y = 1000
z = [True for p in range(x, y+1)]
for k, i in enumerate(z):
if i:
k += x
n = 2
while n * k < y + 1:
z[(n*k) - x] = False
n += 1
k = {p for p in range(a, b+1)}
for p, v in enumerate(z):
if v:
t = p + x
n = 1
while n * t < (b + 1):
if (n * t) in k:
k.remove(n * t)
n += 1
return k
Try your original implementation with those numbers. It takes > 1 minute on my computer. This implementation takes under 2 seconds.

Ultimate optimization caveat: Do not pre-maturely optimize. Any time you attempt to optimize code, profile it to ensure it needs optimization, and profile the optimization on the same kind of data you intend it to be optimized for to confirm it is a speedup. Almost all code does not need optimization, just to give the correct answer.
If you are optimizing for small x-y and large a-b:
Create an array with length that is the lowest common multiple out of all the x, x+1, x+2... y. For example, for 2, 3, 4, 5 it would be 60, not 120.
Now populate this array with booleans - false initially for every cell, then for each number in x-y, populate all entries in the array that are multiples of that number with true.
Now for each number in a-b, index into the array modulo arraylength and if it is true, skip else if it is false, return.
You can do this a little quicker by removing from you x to y factors numbers whos prime factor expansions are strict supersets of other numbers' prime factor expansions. By which I mean - if you have 2, 3, 4, 5, 4 is 2*2 a strict superset of 2 so you can remove it and now our array length is only 30. For something like 3, 4, 5, 6 however, 4 is 2*2 and 6 is 3*2 - 6 is a superset of 3 so we remove it, but 4 is not a superset of everything so we keep it in. LCM is 3*2*2*5 = 60. Doing this kind of thing would give some speed up on its own for large a-b, and you might not need to go the array direction if that's all you need.
Also, keep in mind that if you aren't going to use the entire result of the function every single time - like, maybe sometimes you're only interested in the lowest value - write it as a generator rather than as a function. That way you can call it until you have enough numbers and then stop, saving time.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.