Calculate average value of all sorted array combinations

Calculate average value of all sorted array combinations - python

I need to get the statistical expected value of a n choose k drawing in a sorted array.
As an example, let's consider I want to choose 2 elements from the following sorted array
[1, 2, 3]
The set of all possible combinations is the following:
(1, 2)
(1, 3)
(2, 3)
So the expected value of the first element is (1 + 1 + 2) / 3 = 1.33, and the expected value of the second element is (2 + 3 + 3) = 2.67
Here is a function that works with a bruteforce approach for doing that, but it is too slow to be used on large arrays.
Is there a smarter/faster way?
import itertools
import math
def combinations_expected_value(arr, k):
sums = [0] * k
l = math.comb(len(arr), k)
for comb in itertools.combinations(arr, k):
for i in range(k):
sums[i] += comb[i]
return [sums[i] / l for i in range(k)]
Thank you!

For each position in the combination, the possible values are a subset of the list starting at the position and up to the last k-p-1 element. e.g. for combinations of 6 in 1..100, position 3 can only contain values 3..96
For each of the positon/value pairs, the number of occurrences will be the product of combinations of left side elements and combinations of right side elements.
For example, for combinations of 6 elements within a list of 1..100, the number of times 45 will appear at the third position is the combinations of 2 in 1..44 times the combinations of 3 in 46..100. So we will have C(44,2) * C(55,3) * 45 for that positon/value pair.
You can repeat this calculation for each positon/value pair to obtain a total for each position in the output combinations. Then divide these totals by the number of combinations to get the expected value:
from math import comb
def countComb(N,k):
result = [0]*k
for p in range(k): # p is count on the left
q = k-p-1 # q is count on the right
for i in range(p,len(N)-q):
left = comb(i,p) # combinations on the left >= 1
right = comb(len(N)-i-1,q) # combinations on the right >= 1
result[p] += left * right * N[i]
return result
def combProb(N,k):
Cnk = comb(len(N),k)
return [S/Cnk for S in countComb(N,k)]
Output:
print(countComb([1,2,3],2)) # [4, 8]
print(combProb([1,2,3],2)) # [1.3333333333333333, 2.6666666666666665]
print(countComb([1,2,3,4,5],3)) # [15, 30, 45]
print(combProb([1,2,3,4,5],3)) # [1.5, 3.0, 4.5]
# test with large number of combinations:
print(countComb(list(range(1,301)),7))
[1521500803497675, 3043001606995350, 4564502410493025,
6086003213990700, 7607504017488375, 9129004820986050,
10650505624483725]
print(combProb(list(range(1,301)),7))
[37.625, 75.25, 112.875, 150.5, 188.125, 225.75, 263.375]

Related

Combinatorics 1 to 1 mapping for Power Groups

Problem Description:
I'm working on making a function which gives me a definition for a particular combination of several descriptors based on a single index. My inputs are a set of raw features X = [feat0,feat1,feat2,feat3,feat4], a list of powers to be used pow = [1,2,3], and a list of group sizes sizes = [1,3,5]. A valid output might look like the following:
feat0^2 * feat4^3 * feat1^1
This output is valid because feat0, feat4, and feat1 exist within X, their powers exist within pow, and the number of features being combined is in sizes.
Invalid edge cases include:
values which don't exist in X, powers not in pow, and combination sizes not in sizes
combinations that are identical to another are invalid: feat0^2 * feat1^3 and feat1^3 * feat0^2 are the same
combinations that include multiples of the same feature are invalid: feat0^1 * feat0^3 * feat2^2 is invalid
under the hood I'm encoding these groupings as lists of tuples. So feat0^2 * feat4^3 * feat1^1 would be represented as [(0,2), (4,3), (1,1)], where the first element in the tuple is the feature index, and the second is the power.
Question:
my question is, how can I create a 1 to 1 mapping of a particular combination to an index i? I would like to get the number of possible combinations, and be able to plug in an integer i to a function, and have that function generate a particular combination. Something like this:
X = [0.123, 0.111, 11, -5]
pow = [1,2,3]
sizes = [1,3]
#getting total number of combinations
numCombos = get_num_combos(X,pow,sizes)
#getting a random index corresponding to a grouping
i = random.randint(0, numCombos)
#getting grouping
grouping = generate_grouping(i, X, pow, sizes)
print(grouping)
Resulting in something like
[(0,1), (1,2), (3,1)]
So far, figuring out the generation when not accounting for the various edge cases wasn't too hard, but I'm at a loss for how to account for edge cases 2 and 3; making it guaranteed that no value of i is algebraically equivalent to any other value of i, and that the same feature does not appear multiple times in a grouping.
Current Progress
#computes the n choose k of a list and a size
def get_num_groupings(n, k):
return int(math.factorial(n)/(math.factorial(k)*math.factorial(n-k)))
import numpy as np
import bisect
i = 150
n = 5
m = 3
sizes = [1, 3, 5]
#computing the number of elements in each group length
numElements = [m**k * get_num_groupings(n, k) for k in sizes]
#index bins for each group size
bins = list(np.cumsum(numElements))[:-1]
#getting the current group size
binIdx = bisect.bisect_left(bins,i)
curSize = sizes[binIdx]
#adding idx 0 to bins
bins = [0]+bins
#getting the location of i in the bin
z = i - bins[binIdx]
#getting the product index and combination rank
pi = z // m**k
ci = z % m**k
#getting the indexes of the powers
pidx = [(pi // m**(curSize - (num+1)))%m for num in range(curSize)]
#getting the indexes of the features
#TODO cidx = unrank(i, range(n))
This is based on the Mad Physicist's answer. Though I haven't figured out how to get cidx yet. Some of the variable names are rewritten for my own understanding. To my knowledge this implimentation works by logically separating the combinations of variables and which powers they each have. So far, I can get the powers from an index i, and once unrank is ironed out I should be able to get the indexes for which features are used.

Let's look at a slightly different problem that's closely related to what to want: generate all the possible valid combinations.
If you choose a size and a power, finding all possible combinations of features is fairly straightforward:
from itertools import combinations, product
n = len(X)
m = len(powers)
k = size = ... # e.g. 3
pow = ... # e.g. [1, 2, 3]
The iterator of unique combinations of features is given by
def elements(X, size, pow):
for x in combinations(X, size):
yield sum(e**p for p, e in zip(pow, x))
The equivalent one-liner would he
(sum(e**p for p, e in zip(pow, x)) for x in combinations(X, size))
This generator has exactly n choose k unique elements. These elements meet all your conditions by definition.
Now you can loop over all possible sizes and product of powers to get all the options:
def all_features(X, sizes, powers):
for size in sizes:
for pow in product(powers, repeat=size):
for x in combinations(X, size):
yield sum(e**p for p, e in zip(pow, x))
The total number of elements is the sum for each k of m**k * n choose k.
Now that you've counted the possibilities, you can compute the mapping of element to index and vice versa, using a combinatorial number system. Sample ranking and unranking functions for combinations are shown here. You can use them after you adjust the index for the size and power bins.
To show what I mean, assume you have three functions (given in the linked answer):
choose(n, k) computes n choose k
rank(combo) accepts the ordered indices of a specific commination and returns the rank.
unrank(ind, k) accepts a rank and size, and returns the k indices of the corresponding combination.
You can then compute the offsets of each size group and the step for each power within that group. Let's work through your concrete example with n = 5, m = 3, and sizes = [1, 3, 5].
The number of elements for each size is given by
elements = [m**k * choose(n, k) for k in sizes]
The total number of possible arrangements is sum(elements):
3**1 * choose(5, 1) + 3**3 * choose(5, 3) + 3**5 * choose(5, 5) = 3 * 5 + 27 * 10 + 243 * 1 = 15 + 270 + 243 = 528
The cumulative sum is useful to convert between index and element:
cumsum = [0, 15, 285]
When you get an index, you can check which bin it falls in using bisect.
Let's say you were given index = 55. Since 15 < 55 < 285, your offset is 15, size = 3. Within the size = 3 group, you have an offset of z = 55 - 15 = 40.
Within the k = 3 group, there are m**k = 3**3 = 27 power products. The index of the product is pi = z // m**k and the combination rank is ci = z % m**k.
So the indices of the power are given by
pidx = [(pi // m**(k - 1)) % m, (pi // m**(k - 2)) % m, ...]
Similarly, the indices of the combination are given by
cidx = unrank(ci, k)
You can convert all these indices into a value using something like
sum(X[q]**powers[p] for p, q in zip(pidx, cidx))

Count all pairs with given XOR

Given a list of size N. Find the number of pairs (i, j) such that A[i] XOR A[j] = x, and 1 <= i < j <= N.
Input : list = [3, 6, 8, 10, 15, 50], x = 5
Output : 2
Explanation : (3 ^ 6) = 5 and (10 ^ 15) = 5
This is my code (brute force):
import itertools
n=int(input())
pairs=0
l=list(map(int,raw_input().split()))
q=[x for x in l if x%2==0]
p=[y for y in l if y%2!=0]
for a, b in itertools.combinations(q, 2):
if (a^b!=2) and ((a^b)%2==0) and (a!=b):
pairs+=1
for a, b in itertools.combinations(p, 2):
if (a^b!=2) and ((a^b)%2==0) and (a!=b):
pairs+=1
print pairs
how to do this more efficiently in a complexity of O(n) in python?

Observe that if A[i]^A[j] == x, this implies that A[i]^x == A[j] and A[j]^x == A[i].
So, an O(n) solution would be to iterate through an associate map (dict) where each key is an item from A and each value is the respective count of the item. Then, for each item, calculate A[i]^x, and see if A[i]^x is in the map. If it is in the map, this implies that A[i]^A[j] == x for some j. Since we have a map with the count of all items that equal A[j], the total number of pairs will be num_Ai * num_Aj. Note that each element will be counted twice since XOR is commutative (i.e. A[i]^A[j] == A[j]^A[i]), so we have to divide the final count by 2 since we've double counted each pair.
def create_count_map(lst):
result = {}
for item in lst:
if item in result:
result[item] += 1
else:
result[item] = 1
return result
def get_count(lst, x):
count_map = create_count_map(lst)
total_pairs = 0
for item in count_map:
xor_res = item ^ x
if xor_res in count_map:
total_pairs += count_map[xor_res] * count_map[item]
return total_pairs // 2
print(get_count([3, 6, 8, 10, 15, 50], 5))
print(get_count([1, 3, 1, 3, 1], 2))
outputs
2
6
as desired.
Why is this O(n)?
Converting a list to a dict s.t. the dict contains the count of each item in the list is O(n) time.
Calculating item ^ x is O(1) time, and calculating whether this result is in a dict is also O(1) time. dict key access is also O(1), and so is multiplication of two elements. We do all this n times, hence O(n) time for the loop.
O(n) + O(n) reduces to O(n) time.
Edited to handle duplicates correctly.

The accepted answer is not giving the correct result for X=0. This code handles that minute error. You can modify it to get answers for other values as well.
def calculate(a) :
# Finding the maximum of the array
maximum = max(a)
# Creating frequency array
# With initial value 0
frequency = [0 for x in range(maximum + 1)]
# Traversing through the array
for i in a :
# Counting frequency
frequency[i] += 1
answer = 0
# Traversing through the frequency array
for i in frequency :
# Calculating answer
answer = answer + i * (i - 1) // 2
return answer

Increment first n list elements given a condition

I have a list for example
l = [10, 20, 30, 40, 50, 60]
I need to increment the first n elements of the list given a condition. The condition is independent of the list. For example if n = 3, the list l should become :
l = [11, 21, 31, 40, 50, 60]
I understand that I can do it with a for loop on each element of the list. But I need to do such operation around 150 million times. So, I am looking for a faster method to do this. Any help is highly appreciated. Thanks in advance

Here's an operation-aggregating implementation in NumPy:
initial_array = # whatever your l is, but as a NumPy array
increments = numpy.zeros_like(initial_array)
...
# every time you want to increment the first n elements
if n:
increments[n-1] += 1
...
# to apply the increments
initial_array += increments[::-1].cumsum()[::-1]
This is O(ops + len(initial_array)), where ops is the number of increment operations. Unless you're only doing a small number of increments over a very small portion of the list, this should be much faster. Unlike the naive implementation, it doesn't let you retrieve element values until the increments are applied; if you need to do that, you might need a solution based on a BST or BST-like structure to track increments.

m - queries count, n - list to increment length, O(n + m) algorithm idea:
since you only have to increment from start to some k-th element you will get ranges of increments. Let our increment be pair (up to position, increment by). Example:
(1, 2) - increment positions 0 and 1 by 2
If we are trying to calculate value at position k then we should add increments that have positions greater or equal than k to current value at position k. How we can quickly calculate sum of increments that have positions greater or equal than k? We can start calculating values from the back of the list and then remember sum of increments.
Proof of concept:
# list to increment
a = [1, 2, 5, 1, 6]
# (up to and including k-th index, increment by value)
queries = [(1, 2), (0, 10), (3, 11), (4, 3)]
# decribed algorithm implementation
increments = [0]*len(a)
for position, inc in queries:
increments[position] += inc
got = list(a)
increments_sum = 0
for i in xrange(len(increments) -1, -1, -1):
increments_sum += increments[i]
got[i] += increments_sum
# verify that solution is correct using slow but correct algorithm
expected = list(a)
for position, inc in queries:
for i in xrange(position + 1):
expected[i] += inc
print 'Expected: ', expected
print 'Got: ', got
output:
Expected: [27, 18, 19, 15, 9]
Got: [27, 18, 19, 15, 9]

You can create a simple data structure on top of your list which stores the start and end range of each increment operation. The start would be 0 in your case so you can just store the end.
This way you don't have to actually traverse the list to increment the elements, but you only retain that you performed increments on ranges for example {0 to 2} and {0 to 3}. Furthermore, you can also collate some operations, so that if multiple operations increment until the same index, you only need to store one entry.
The worst case complexity of this solution is O(q + g x qlogq + n) where g is the number of get operations, q is the number of updates and n is the length of the list. Since we can have at most n distinct endings for the intervals this reduces to O(q + nlogn + n) = O(q + nlogn). A naive solution using an update for each query would be O(q * l) where l (the length of a query) could be up to the size of n giving O(q * n). So we can expect this solution to be better when q > log n.
Working python example below:
def RangeStructure(object):
def __init__(self, l):
self.ranges = collections.defaultdict(int)
self.l = l
def incToPosition(self, k):
self.ranges[k] += 1
def get(self):
res = self.l
sorted_keys = sorted(self.ranges)
last = len(sorted_keys) - 1
to_add = 0
while last >= 0:
start = 0 if last < 1 else sorted_keys[last - 1]
end = sorted_keys[last]
to_add += self.ranges[end]
for i in range(start, end):
res[i] += to_add
last -= 1
return res
rs = RangeStructure([10, 20, 30, 40, 50, 60])
rs.incToPosition(2)
rs.incToPosition(2)
rs.incToPosition(3)
rs.incToPosition(4)
print rs.get()
And an explanation:
after the inc operations ranges will contain (start, end, inc) tuples of the form (0, 2, 2), (0, 3, 1), (0, 4, 1); these will be represented in the dict as { 2:2, 3:1, 4:1} since the start is always 1 and can be omitted
during the get operation, we ensure that we only operate on any list element once; we sort the ranges in increasing order of their end point, and traverse them in reverse order updating the contained list elements and the sum (to_add) to be added to subsequent ranges
This prints, as expected:
[14, 24, 32, 41, 50, 60]

You can use list comprehension and add the remaining list
[x + 1 for x in a[:n]]+a[n:]

Speeding up algorithm that finds multiples in a given range

I'm a stumped on how to speed up my algorithm which sums multiples in a given range. This is for a problem on codewars.com here is a link to the problem
codewars link
Here's the code and i'll explain what's going on in the bottom
import itertools
def solution(number):
return multiples(3, number) + multiples(5, number) - multiples(15, number)
def multiples(m, count):
l = 0
for i in itertools.count(m, m):
if i < count:
l += i
else:
break
return l
print solution(50000000) #takes 41.8 seconds
#one of the testers takes 50000000000000000000000000000000000000000 as input
# def multiples(m, count):
# l = 0
# for i in xrange(m,count ,m):
# l += i
# return l
so basically the problem ask the user return the sum of all the multiples of 3 and 5 within a number. Here are the testers.
test.assert_equals(solution(10), 23)
test.assert_equals(solution(20), 78)
test.assert_equals(solution(100), 2318)
test.assert_equals(solution(200), 9168)
test.assert_equals(solution(1000), 233168)
test.assert_equals(solution(10000), 23331668)
my program has no problem getting the right answer. The problem arises when the input is large. When pass in a number like 50000000 it takes over 40 seconds to return the answer. One of the inputs i'm asked to take is 50000000000000000000000000000000000000000, which a is huge number. That's also the reason why i'm using itertools.count() I tried using xrange in my first attempt but range can't handle numbers larger than a c type long. I know the slowest part the problem is the multiples method...yet it is still faster then my first attempt using list comprehension and checking whether i % 3 == 0 or i % 5 == 0, any ideas guys?

This solution should be faster for large numbers.
def solution(number):
number -= 1
a, b, c = number // 3, number // 5, number // 15
asum, bsum, csum = a*(a+1) // 2, b*(b+1) // 2, c*(c+1) // 2
return 3*asum + 5*bsum - 15*csum
Explanation:
Take any sequence from 1 to n:
1, 2, 3, 4, ..., n
And it's sum will always be given by the formula n(n+1)/2. This can be proven easily if you consider that the expression (1 + n) / 2 is just a shortcut for computing the average, or Arithmetic mean of this particular sequence of numbers. Because average(S) = sum(S) / length(S), if you take the average of any sequence of numbers and multiply it by the length of the sequence, you get the sum of the sequence.
If we're given a number n, and we want the sum of the multiples of some given k up to n, including n, we want to find the summation:
k + 2k + 3k + 4k + ... xk
where xk is the highest multiple of k that is less than or equal to n. Now notice that this summation can be factored into:
k(1 + 2 + 3 + 4 + ... + x)
We are given k already, so now all we need to find is x. If x is defined to be the highest number you can multiply k by to get a natural number less than or equal to n, then we can get the number x by using Python's integer division:
n // k == x
Once we find x, we can find the sum of the multiples of any given k up to a given n using previous formulas:
k(x(x+1)/2)
Our three given k's are 3, 5, and 15.
We find our x's in this line:
a, b, c = number // 3, number // 5, number // 15
Compute the summations of their multiples up to n in this line:
asum, bsum, csum = a*(a+1) // 2, b*(b+1) // 2, c*(c+1) // 2
And finally, multiply their summations by k in this line:
return 3*asum + 5*bsum - 15*csum
And we have our answer!

MaxDoubleSliceSum Algorithm

I'm trying to solve the problem of finding the MaxDoubleSliceSum value. Simply, it's the maximum sum of any slice minus one element within this slice (you have to drop one element, and the first and the last element are excluded also). So, technically the first and the last element of the array cannot be included in any slice sum.
Here's the full description:
A non-empty zero-indexed array A consisting of N integers is given.
A triplet (X, Y, Z), such that 0 ≤ X < Y < Z < N, is called a double slice.
The sum of double slice (X, Y, Z) is the total of A[X + 1] + A[X + 2] + ... + A[Y − 1] + A[Y + 1] + A[Y + 2] + ... + A[Z − 1].
For example, array A such that:
A[0] = 3
A[1] = 2
A[2] = 6
A[3] = -1
A[4] = 4
A[5] = 5
A[6] = -1
A[7] = 2
contains the following example double slices:
double slice (0, 3, 6), sum is 2 + 6 + 4 + 5 = 17,
double slice (0, 3, 7), sum is 2 + 6 + 4 + 5 − 1 = 16,
double slice (3, 4, 5), sum is 0.
The goal is to find the maximal sum of any double slice.
Write a function:
def solution(A)
that, given a non-empty zero-indexed array A consisting of N integers, returns the maximal sum of any double slice.
For example, given:
A[0] = 3
A[1] = 2
A[2] = 6
A[3] = -1
A[4] = 4
A[5] = 5
A[6] = -1
A[7] = 2
the function should return 17, because no double slice of array A has a sum of greater than 17.
Assume that:
N is an integer within the range [3..100,000];
each element of array A is an integer within the range [−10,000..10,000].
Complexity:
expected worst-case time complexity is O(N);
expected worst-case space complexity is O(N), beyond input storage (not counting the storage required for input arguments).
Elements of input arrays can be modified.
Here's my try:
def solution(A):
if len(A) <= 3:
return 0
max_slice = 0
minimum = A[1] # assume the first element is the minimum
max_end = -A[1] # and drop it from the slice
for i in xrange(1, len(A)-1):
if A[i] < minimum: # a new minimum found
max_end += minimum # put back the false minimum
minimum = A[i] # assign the new minimum to minimum
max_end -= minimum # drop the new minimum out of the slice
max_end = max(0, max_end + A[i])
max_slice = max(max_slice, max_end)
return max_slice
What makes me think that this may approach the correct solution but some corners of the problem may haven't been covered is that 9 out 14 test cases pass correctly (https://codility.com/demo/results/demoAW7WPN-PCV/)
I know that this can be solved by applying Kadane’s algorithm forward and backward. but I'd really appreciate it if someone can point out what's missing here.

Python solution O(N)
This should be solved using Kadane’s algorithm from two directions.
ref:
Python Codility Solution
C++ solution - YouTube tutorial
JAVA solution
def compute_sum(start, end, step, A):
res_arr = [0]
res = 0
for i in range(start, end, step):
res = res + A[i]
if res < 0:
res_arr.append(0)
res = 0
continue
res_arr.append(res)
return res_arr
def solution(A):
if len(A) < 3:
return 0
arr = []
left_arr = compute_sum(1, len(A)-1, 1, A)
right_arr = compute_sum(len(A)-2, 0, -1, A)
k = 0
for i in range(len(left_arr)-2, -1, -1):
arr.append(left_arr[i] + right_arr[k])
k = k + 1
return max(arr)

This is just how I'd write the algorithm.
Assume a start index of X=0, then iteratively sum the squares to the right.
Keep track of the index of the lowest int as you count, and subtract the lowest int from the sum when you use it. This effectively lets you place your Y.
Keep track of the max sum, and the X, Y, Z values for that sum
if the sum ever turns negative then save the max sum as your result, so long as it is greater than the previous result.
Choose a new X, You should start looking after Y and subtract one from whatever index you find. And repeat the previous steps, do this until you have reached the end of the list.
How might this be an improvement?
Potential problem case for your code: [7, 2, 4, -18, -14, 20, 22]
-18 and -14 separate the array into two segments. The sum of the first segment is 7+2+4=13, the sum of the second segment is just 20. The above algorithm handles this case, yours might but I'm bad at python (sorry).
EDIT (error and solution): It appears my original answer brings nothing new to what I thought was the problem, but I checked the errors and found the actual error occurs here: [-20, -10, 10, -70, 20, 30, -30] will not be handled correctly. It will exclude the positive 10, so it returns 50 instead of 60.
It appears the askers code doesn't correctly identify the new starting position (my method for this is shown in case 4), it's important that you restart the iterations at Y instead of Z because Y effectively deletes the lowest number, which is possibly the Z that fails the test.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.