Maximum sum of subsequence of length L with a restriction - python
Given an array of positive integers. How to find a subsequence of length L with max sum which has the distance between any two of its neighboring elements that do not exceed K
I have the following solution but don't know how to take into account length L.
1 <= N <= 100000, 1 <= L <= 200, 1 <= K <= N
f[i] contains max sum of the subsequence that ends in i.
for i in range(K, N)
f[i] = INT_MIN
for j in range(1, K+1)
f[i] = max(f[i], f[i-j] + a[i])
return max(f)
(edit: slightly simplified non-recursive solution)
You can do it like this, just for each iteration consider if the item should be included or excluded.
def f(maxK,K, N, L, S):
if L == 0 or not N or K == 0:
return S
#either element is included
included = f(maxK,maxK, N[1:], L-1, S + N[0] )
#or excluded
excluded = f(maxK,K-1, N[1:], L, S )
return max(included, excluded)
assert f(2,2,[10,1,1,1,1,10],3,0) == 12
assert f(3,3,[8, 3, 7, 6, 2, 1, 9, 2, 5, 4],4,0) == 30
If N is very long you can consider changing to a table version, you could also change the input to tuples and use memoization.
Since OP later included the information that N can be 100 000, we can't really use recursive solutions like this. So here is a solution that runs in O(nKL), with same memory requirement:
import numpy as np
def f(n,K,L):
t = np.zeros((len(n),L+1))
for l in range(1,L+1):
for i in range(len(n)):
t[i,l] = n[i] + max( (t[i-k,l-1] for k in range(1,K+1) if i-k >= 0), default = 0 )
return np.max(t)
assert f([10,1,1,1,1,10],2,3) == 12
assert f([8, 3, 7, 6, 2, 1, 9],3,4) == 30
Explanation of the non recursive solution. Each cell in the table t[ i, l ] expresses the value of max subsequence with exactly l elements that use the element in position i and only elements in position i or lower where elements have at most K distance between each other.
subsequences of length n (those in t[i,1] have to have only one element, n[i] )
Longer subsequences have the n[i] + a subsequence of l-1 elements that starts at most k rows earlier, we pick the one with the maximal value. By iterating this way, we ensure that this value is already calculated.
Further improvements in memory is possible by considering that you only look at most K steps back.
Here is a bottom up (ie no recursion) dynamic solution in Python. It takes memory O(l * n) and time O(l * n * k).
def max_subseq_sum(k, l, values):
# table[i][j] will be the highest value from a sequence of length j
# ending at position i
table = []
for i in range(len(values)):
# We have no sum from 0, and i from len 1.
table.append([0, values[i]])
# By length of previous subsequence
for subseq_len in range(1, l):
# We look back up to k for the best.
prev_val = None
for last_i in range(i-k, i):
# We don't look back if the sequence was not that long.
if subseq_len <= last_i+1:
# Is this better?
this_val = table[last_i][subseq_len]
if prev_val is None or prev_val < this_val:
prev_val = this_val
# Do we have a best to offer?
if prev_val is not None:
table[i].append(prev_val + values[i])
# Now we look for the best entry of length l.
best_val = None
for row in table:
# If the row has entries for 0...l will have len > l.
if l < len(row):
if best_val is None or best_val < row[l]:
best_val = row[l]
return best_val
print(max_subseq_sum(2, 3, [10, 1, 1, 1, 1, 10]))
print(max_subseq_sum(3, 4, [8, 3, 7, 6, 2, 1, 9, 2, 5, 4]))
If I wanted to be slightly clever I could make this memory O(n) pretty easily by calculating one layer at a time, throwing away the previous one. It takes a lot of cleverness to reduce running time to O(l*n*log(k)) but that is doable. (Use a priority queue for your best value in the last k. It is O(log(k)) to update it for each element but naturally grows. Every k values you throw it away and rebuild it for a O(k) cost incurred O(n/k) times for a total O(n) rebuild cost.)
And here is the clever version. Memory O(n). Time O(n*l*log(k)) worst case, and average case is O(n*l). You hit the worst case when it is sorted in ascending order.
import heapq
def max_subseq_sum(k, l, values):
count = 0
prev_best = [0 for _ in values]
# i represents how many in prev subsequences
# It ranges from 0..(l-1).
for i in range(l):
# We are building subsequences of length i+1.
# We will have no way to find one that ends
# before the i'th element at position i-1
best = [None for _ in range(i)]
# Our heap will be (-sum, index). It is a min_heap so the
# minimum element has the largest sum. We track the index
# so that we know when it is in the last k.
min_heap = [(-prev_best[i-1], i-1)]
for j in range(i, len(values)):
# Remove best elements that are more than k back.
while min_heap[0][-1] < j-k:
heapq.heappop(min_heap)
# We append this value + (best prev sum) using -(-..) = +.
best.append(values[j] - min_heap[0][0])
heapq.heappush(min_heap, (-prev_best[j], j))
# And now keep min_heap from growing too big.
if 2*k < len(min_heap):
# Filter out elements too far back.
min_heap = [_ for _ in min_heap if j - k < _[1]]
# And make into a heap again.
heapq.heapify(min_heap)
# And now finish this layer.
prev_best = best
return max(prev_best)
Extending the code for itertools.combinations shown at the docs, I built a version that includes an argument for the maximum index distance (K) between two values. It only needed an additional and indices[i] - indices[i-1] < K check in the iteration:
def combinations_with_max_dist(iterable, r, K):
# combinations('ABCD', 2) --> AB AC AD BC BD CD
# combinations(range(4), 3) --> 012 013 023 123
pool = tuple(iterable)
n = len(pool)
if r > n:
return
indices = list(range(r))
yield tuple(pool[i] for i in indices)
while True:
for i in reversed(range(r)):
if indices[i] != i + n - r and indices[i] - indices[i-1] < K:
break
else:
return
indices[i] += 1
for j in range(i+1, r):
indices[j] = indices[j-1] + 1
yield tuple(pool[i] for i in indices)
Using this you can bruteforce over all combinations with regards to K, and then find the one that has the maximum value sum:
def find_subseq(a, L, K):
return max((sum(values), values) for values in combinations_with_max_dist(a, L, K))
Results:
print(*find_subseq([10, 1, 1, 1, 1, 10], L=3, K=2))
# 12 (10, 1, 1)
print(*find_subseq([8, 3, 7, 6, 2, 1, 9, 2, 5, 4], L=4, K=3))
# 30 (8, 7, 6, 9)
Not sure about the performance if your value lists become very long though...
Algorithm
Basic idea:
Iteration on input array, choose each index as the first taken element.
Then Recursion on each first taken element, mark the index as firstIdx.
The next possible index would be in range [firstIdx + 1, firstIdx + K], both inclusive.
Loop on the range to call each index recursively, with L - 1 as the new L.
Optionally, for each pair of (firstIndex, L), cache its max sum, for reuse.
Maybe this is necessary for large input.
Constraints:
array length <= 1 << 17 // 131072
K <= 1 << 6 // 64
L <= 1 << 8 // 256
Complexity:
Time: O(n * L * K)
Since each (firstIdx , L) pair only calculated once, and that contains a iteration of K.
Space: O(n * L)
For cache, and method stack in recursive call.
Tips:
Depth of recursion is related to L, not array length.
The defined constraints are not the actual limit, it could be larger, though I didn't test how large it can be.
Basically:
Both array length and K actually could be of any size as long as there are enough memory, since they are handled via iteration.
L is handled via recursion, thus it does has a limit.
Code - in Java
SubSumLimitedDistance.java:
import java.util.HashMap;
import java.util.Map;
public class SubSumLimitedDistance {
public static final long NOT_ENOUGH_ELE = -1; // sum that indicate not enough element, should be < 0,
public static final int MAX_ARR_LEN = 1 << 17; // max length of input array,
public static final int MAX_K = 1 << 6; // max K, should not be too long, otherwise slow,
public static final int MAX_L = 1 << 8; // max L, should not be too long, otherwise stackoverflow,
/**
* Find max sum of sum array.
*
* #param arr
* #param K
* #param L
* #return max sum,
*/
public static long find(int[] arr, int K, int L) {
if (K < 1 || K > MAX_K)
throw new IllegalArgumentException("K should be between [1, " + MAX_K + "], but get: " + K);
if (L < 0 || L > MAX_L)
throw new IllegalArgumentException("L should be between [0, " + MAX_L + "], but get: " + L);
if (arr.length > MAX_ARR_LEN)
throw new IllegalArgumentException("input array length should <= " + MAX_ARR_LEN + ", but get: " + arr.length);
Map<Integer, Map<Integer, Long>> cache = new HashMap<>(); // cache,
long maxSum = NOT_ENOUGH_ELE;
for (int i = 0; i < arr.length; i++) {
long sum = findTakeFirst(arr, K, L, i, cache);
if (sum == NOT_ENOUGH_ELE) break; // not enough elements,
if (sum > maxSum) maxSum = sum; // larger found,
}
return maxSum;
}
/**
* Find max sum of sum array, with index of first taken element specified,
*
* #param arr
* #param K
* #param L
* #param firstIdx index of first taken element,
* #param cache
* #return max sum,
*/
private static long findTakeFirst(int[] arr, int K, int L, int firstIdx, Map<Integer, Map<Integer, Long>> cache) {
// System.out.printf("findTakeFirst(): K = %d, L = %d, firstIdx = %d\n", K, L, firstIdx);
if (L == 0) return 0; // done,
if (firstIdx + L > arr.length) return NOT_ENOUGH_ELE; // not enough elements,
// check cache,
Map<Integer, Long> map = cache.get(firstIdx);
Long cachedResult;
if (map != null && (cachedResult = map.get(L)) != null) {
// System.out.printf("hit cache, cached result = %d\n", cachedResult);
return cachedResult;
}
// cache not exists, calculate,
long maxRemainSum = NOT_ENOUGH_ELE;
for (int i = firstIdx + 1; i <= firstIdx + K; i++) {
long remainSum = findTakeFirst(arr, K, L - 1, i, cache);
if (remainSum == NOT_ENOUGH_ELE) break; // not enough elements,
if (remainSum > maxRemainSum) maxRemainSum = remainSum;
}
if ((map = cache.get(firstIdx)) == null) cache.put(firstIdx, map = new HashMap<>());
if (maxRemainSum == NOT_ENOUGH_ELE) { // not enough elements,
map.put(L, NOT_ENOUGH_ELE); // cache - as not enough elements,
return NOT_ENOUGH_ELE;
}
long maxSum = arr[firstIdx] + maxRemainSum; // max sum,
map.put(L, maxSum); // cache - max sum,
return maxSum;
}
}
SubSumLimitedDistanceTest.java:
(test case, via TestNG)
import org.testng.Assert;
import org.testng.annotations.BeforeClass;
import org.testng.annotations.Test;
import java.util.concurrent.ThreadLocalRandom;
public class SubSumLimitedDistanceTest {
private int[] arr;
private int K;
private int L;
private int maxSum;
private int[] arr2;
private int K2;
private int L2;
private int maxSum2;
private int[] arrMax;
private int KMax;
private int KMaxLargest;
private int LMax;
private int LMaxLargest;
#BeforeClass
private void setUp() {
// init - arr,
arr = new int[]{10, 1, 1, 1, 1, 10};
K = 2;
L = 3;
maxSum = 12;
// init - arr2,
arr2 = new int[]{8, 3, 7, 6, 2, 1, 9, 2, 5, 4};
K2 = 3;
L2 = 4;
maxSum2 = 30;
// init - arrMax,
arrMax = new int[SubSumLimitedDistance.MAX_ARR_LEN];
ThreadLocalRandom rd = ThreadLocalRandom.current();
long maxLongEle = Long.MAX_VALUE / SubSumLimitedDistance.MAX_ARR_LEN;
int maxEle = maxLongEle > Integer.MAX_VALUE ? Integer.MAX_VALUE : (int) maxLongEle;
for (int i = 0; i < arrMax.length; i++) {
arrMax[i] = rd.nextInt(maxEle);
}
KMax = 5;
LMax = 10;
KMaxLargest = SubSumLimitedDistance.MAX_K;
LMaxLargest = SubSumLimitedDistance.MAX_L;
}
#Test
public void test() {
Assert.assertEquals(SubSumLimitedDistance.find(arr, K, L), maxSum);
Assert.assertEquals(SubSumLimitedDistance.find(arr2, K2, L2), maxSum2);
}
#Test(timeOut = 6000)
public void test_veryLargeArray() {
run_printDuring(arrMax, KMax, LMax);
}
#Test(timeOut = 60000) // takes seconds,
public void test_veryLargeArrayL() {
run_printDuring(arrMax, KMax, LMaxLargest);
}
#Test(timeOut = 60000) // takes seconds,
public void test_veryLargeArrayK() {
run_printDuring(arrMax, KMaxLargest, LMax);
}
// run find once, and print during,
private void run_printDuring(int[] arr, int K, int L) {
long startTime = System.currentTimeMillis();
long sum = SubSumLimitedDistance.find(arr, K, L);
long during = System.currentTimeMillis() - startTime; // during in milliseconds,
System.out.printf("arr length = %5d, K = %3d, L = %4d, max sum = %15d, running time = %.3f seconds\n", arr.length, K, L, sum, during / 1000.0);
}
#Test
public void test_corner_notEnoughEle() {
Assert.assertEquals(SubSumLimitedDistance.find(new int[]{1}, 2, 3), SubSumLimitedDistance.NOT_ENOUGH_ELE); // not enough element,
Assert.assertEquals(SubSumLimitedDistance.find(new int[]{0}, 1, 3), SubSumLimitedDistance.NOT_ENOUGH_ELE); // not enough element,
}
#Test
public void test_corner_ZeroL() {
Assert.assertEquals(SubSumLimitedDistance.find(new int[]{1, 2, 3}, 2, 0), 0); // L = 0,
Assert.assertEquals(SubSumLimitedDistance.find(new int[]{0}, 1, 0), 0); // L = 0,
}
#Test(expectedExceptions = IllegalArgumentException.class)
public void test_invalid_K() {
// SubSumLimitedDistance.find(new int[]{1, 2, 3}, 0, 2); // K = 0,
// SubSumLimitedDistance.find(new int[]{1, 2, 3}, -1, 2); // K = -1,
SubSumLimitedDistance.find(new int[]{1, 2, 3}, SubSumLimitedDistance.MAX_K + 1, 2); // K = SubSumLimitedDistance.MAX_K+1,
}
#Test(expectedExceptions = IllegalArgumentException.class)
public void test_invalid_L() {
// SubSumLimitedDistance.find(new int[]{1, 2, 3}, 2, -1); // L = -1,
SubSumLimitedDistance.find(new int[]{1, 2, 3}, 2, SubSumLimitedDistance.MAX_L + 1); // L = SubSumLimitedDistance.MAX_L+1,
}
#Test(expectedExceptions = IllegalArgumentException.class)
public void test_invalid_tooLong() {
SubSumLimitedDistance.find(new int[SubSumLimitedDistance.MAX_ARR_LEN + 1], 2, 3); // input array too long,
}
}
Output of test case for large input:
arr length = 131072, K = 5, L = 10, max sum = 20779205738, running time = 0.303 seconds
arr length = 131072, K = 64, L = 10, max sum = 21393422854, running time = 1.917 seconds
arr length = 131072, K = 5, L = 256, max sum = 461698553839, running time = 9.474 seconds
Related
HackerRank "filled orders" problem with Python
Recently HackerRank launched their own certifications. Among the tests they offer is "Problem Solving". The test contains 2 problems; they give you 90 minutes to solve them. Being inexperienced as I am, I failed, because it took me longer than that. Specifically, I came up with the solution for the first problem (filled orders, see below) in, like 30 minutes, and spent the rest of the time trying to debugg it. The problem with it wasn't that the solution didn't work, but that it worked on only some of the test cases. Out of 14 testcases the solution worked on 7 (including all the open ones and a bunch of closed ones), and didn't work on the remaining 7 (all closed). Closed means that the input data is not available, as well as expected output. (Which makes sense, because some of the lists there included 250K+ elements.) But it drives me crazy; I can't figure out what might be wrong with it. I tried putting print statements all over the place, but the only thing I came to is that 1 too many elements get added to the list - hence, the last if statement (to drop the last added element), but it made no difference whatsoever, so it's probably wrong. Here's the problem: A widget manufacturer is facing unexpectedly high demand for its new product,. They would like to satisfy as many customers as possible. Given a number of widgets available and a list of customer orders, what is the maximum number of orders the manufacturer can fulfill in full? Function Description Complete the function filledOrders in the editor below. The function must return a single integer denoting the maximum possible number of fulfilled orders. filledOrders has the following parameter(s): order : an array of integers listing the orders k : an integer denoting widgets available for shipment Constraints 1 ≤ n ≤ 2 x 105 1 ≤ order[i] ≤ 109 1 ≤ k ≤ 109 Sample Input For Custom Testing 2 10 30 40 Sample Output 2 And here's my function: def filledOrders(order, k): total = k fulf = [] for r in order: if r <= total: fulf.append(r) total -= r else: break if sum(fulf) > k: fulf.pop() return len(fulf)
Java Solution int count = 0; Collections.sort(order); for(int i=0; i<order.size(); i++) { if(order.get(i)<=k) { count++; k = k - order.get(i); } } return count;
Code Revision def filledOrders(order, k): total = 0 for i, v in enumerate(sorted(order)): if total + v <= k: total += v # total stays <= k else: return i # provides the count else: return len(order) # was able to place all orders print(filledOrders([3, 2, 1], 3)) # Out: 2 print(filledOrders([3, 2, 1], 1)) # Out: 1 print(filledOrders([3, 2, 1], 10)) # Out: 3 print(filledOrders([3, 2, 1], 0)) # Out: 0
Advanced Javascript solution : function filledOrders(order, k) { // Write your code here let count = 0; let total=0; const ordersLength = order.length; const sortedOrders = order.sort(function(a,b) { return (+a) - (+b); }); for (let i = 0; i < ordersLength; i++) { if (total + sortedOrders[i] <= k) { // if all orders able to be filled if (total <= k && i === ordersLength - 1) return ordersLength; total += sortedOrders[i]; count++; } else { return count; } } }
Python code def filledOrders(order, k): orderfulfilled=0 for i in range(1,len(order)): m=k-order[i] if(m>=0): orderfulfilled+=1 k-=order[i] return(orderfulfilled)
Javascript solution Option1: function filledOrders(order, k) { let count=0; let arr= []; arr = order.sort().filter((item, index) => { if (item<=k) { k = k - item; return item } }) return arr.length } Option2: function filledOrders(order, k) { let count=0; for(var i=0; i<order.sort().length; i++) { if(order[i]<=k) { count++; k = k - order[i] } } return count; }
C# using System.CodeDom.Compiler; using System.Collections.Generic; using System.Collections; using System.ComponentModel; using System.Diagnostics.CodeAnalysis; using System.Globalization; using System.IO; using System.Linq; using System.Reflection; using System.Runtime.Serialization; using System.Text.RegularExpressions; using System.Text; using System; using System.Reflection.Metadata.Ecma335; class Result { /* * Complete the 'filledOrders' function below. * * The function is expected to return an INTEGER. * The function accepts following parameters: * 1. INTEGER_ARRAY order * 2. INTEGER k */ public static int filledOrders(List<int> order, int k) { if (order.Sum() <= k) { return order.Count(); } else { int counter = 0; foreach (int element in order) { if (element <= k) { counter++; k = k - element; } } return counter; } } } class Solution { public static void Main(string[] args) { int orderCount = Convert.ToInt32(Console.ReadLine().Trim()); List<int> order = new List<int>(); for (int i = 0; i < orderCount; i++) { int orderItem = Convert.ToInt32(Console.ReadLine().Trim()); order.Add(orderItem); } int k = Convert.ToInt32(Console.ReadLine().Trim()); var orderedList = order.OrderBy(a=>a).ToList(); int result = Result.filledOrders(orderedList, k); Console.WriteLine(result); } }
I think, the better way to approach (to decrease time complexity) is to solve without use of sorting. (Ofcourse, that comes that cost of readability) Below is a solution without use of sort. (Not sure if I covered all edge cases.) import os, sys def max_fulfilled_orders(order_arr, k): # track the max no.of orders in the arr. max_num = 0 # order count, can be fulfilled. order_count = 0 # iter over order array for i in range(0, len(order_arr)): # if remain value < 0 then if k - order_arr[i] < 0: # add the max no.of orders to total k += max_num if order_count > 0: # decrease order_count order_count -= 1 # if the remain value >= 0 if(k - order_arr[i] >= 0): # subtract the current no.of orders from total. k -= order_arr[i] # increase the order count. order_count += 1 # track the max no.of orders till the point. if order_arr[i] > max_num: max_num = order_arr[i] return order_count print(max_fulfilled_orders([3, 2, 1], 0)) # Out: 0 print(max_fulfilled_orders([3, 2, 1], 1)) # Out: 1 print(max_fulfilled_orders([3, 1, 1], 2)) # Out: 2 print(max_fulfilled_orders([3, 2, 4], 9)) # Out: 3 print(max_fulfilled_orders([3, 2, 1, 4], 10)) # Out: 4
In python, def order_fillers(order,k): if len(order)==0 or k==0: return 0 order.sort() max_orders=0 for item in order: if k<=0: return max_orders if item<=k: max_orders+=1 k-=item return max_orders
JavaScript Solution function filledOrders(order, k) { let total = 0; let count = 0; const ordersLength = order.length; const sortedOrders = order.sort(); for (let i = 0; i < ordersLength; i++) { if (total + sortedOrders[i] <= k) { // if all orders able to be filled if (total <= k && i === ordersLength - 1) return ordersLength; total += sortedOrders[i]; count++; } else { return count; } } } // Validation console.log(filledOrders([3, 2, 1], 3)); // 2 console.log(filledOrders([3, 2, 1], 1)); // 1 console.log(filledOrders([3, 2, 1], 10)); // 3 console.log(filledOrders([3, 2, 1], 0)); // 0 console.log(filledOrders([3, 2, 2], 1)); // 0
number of subsequences whose sum is divisible by k
I just did a coding challenge for a company and was unable to solve this problem. Problem statement goes like: Given an array of integers, find the number of subsequences in the array whose sum is divisible by k, where k is some positive integer. For example, for [4, 1, 3, 2] and k = 3, the solution is 5. [[3], [1, 2], [4,3,2], [4,2], [1,3,2]] are the subsequences whose sum is divisible by k, i.e. current_sum + nums[i] % k == 0, where nums[i] is the current element in the array. I tried to solve this recursively, however, I was unable to pass any test cases. My recursive code followed something like this: def kSum(nums, k): def kSum(cur_sum, i): if i == len(nums): return 0 sol = 1 if (cur_sum + nums[i]) % k == 0 else 0 return sol + kSum(cur_sum, i+1) + kSum(cur_sum + nums[i], i+1) return kSum(0, 0) What is wrong with this recursive approach, and how can I correct it? I'm not interested in an iterative solution, I just want to know why this recursive solution is wrong and how I can correct it.
Are you sure that is not the case test? For example: [4, 1, 3, 2], k = 3 has 4+2 = 6, 1+2=3, 3, 1+2+3=6, 4+2+3 = 9 So, your function is right (it gives me 5) and I don't see a major problem with your function.
Here is a javascript reproduction of what you wrote with some console logs to help explain its behavior. function kSum(nums, k) { let recursive_depth = 1; function _kSum(cur_sum, i) { recursive_depth++; if (i == nums.length) { recursive_depth--; return 0; } let sol = 0; if (((cur_sum + nums[i]) % k) === 0) { sol = 1; console.log(`Found valid sequence ending with ${nums[i]} with sum = ${cur_sum + nums[i]} with partial sum ${cur_sum} at depth ${recursive_depth}`); } const _kSum1 = _kSum(cur_sum, i+1); const _kSum2 = _kSum(cur_sum + nums[i], i+1); const res = sol + _kSum1 + _kSum2; recursive_depth--; return res; } return _kSum(0, 0); } let arr = [4, 1, 3, 2], k = 3; console.log(kSum(arr, k)); I think this code actually gets the right answer. I'm not fluent in Python, but I might have inadvertently fixed a bug in your code though by adding parenthesis around (cur_sum + nums[i]) % k
It seems to me that your solution is correct. It reaches the answer by trying all subsequences, which has 2^n complexity. We could formulate it recursively in an O(n*k) search space, although it could be more efficient to table. Let f(A, k, i, r) represent how many subsequences leave remainder r when their sum is divided by k, using elements up to A[i]. Then: function f(A, k, i=A.length-1, r=0){ // A[i] leaves remainder r // when divided by k const c = A[i] % k == r ? 1 : 0; if (i == 0) return c; return c + // All previous subsequences // who's sum leaves remainder r // when divided by k f(A, k, i - 1, r) + // All previous subsequences who's // sum when combined with A[i] // leaves remainder r when // divided by k f(A, k, i - 1, (k + r - A[i]%k) % k); } console.log(f([1,2,1], 3)); console.log(f([2,3,5,8], 5)); console.log(f([4,1,3,2], 3)); console.log(f([3,3,3], 3));
Python: Finding random k-subset partition for a given list
The following code generates all partitions of length k (k-subset partitions) for a given list. the algorithm could be found in this topic. def algorithm_u(ns, m): def visit(n, a): ps = [[] for i in xrange(m)] for j in xrange(n): ps[a[j + 1]].append(ns[j]) return ps def f(mu, nu, sigma, n, a): if mu == 2: yield visit(n, a) else: for v in f(mu - 1, nu - 1, (mu + sigma) % 2, n, a): yield v if nu == mu + 1: a[mu] = mu - 1 yield visit(n, a) while a[nu] > 0: a[nu] = a[nu] - 1 yield visit(n, a) elif nu > mu + 1: if (mu + sigma) % 2 == 1: a[nu - 1] = mu - 1 else: a[mu] = mu - 1 if (a[nu] + sigma) % 2 == 1: for v in b(mu, nu - 1, 0, n, a): yield v else: for v in f(mu, nu - 1, 0, n, a): yield v while a[nu] > 0: a[nu] = a[nu] - 1 if (a[nu] + sigma) % 2 == 1: for v in b(mu, nu - 1, 0, n, a): yield v else: for v in f(mu, nu - 1, 0, n, a): yield v def b(mu, nu, sigma, n, a): if nu == mu + 1: while a[nu] < mu - 1: yield visit(n, a) a[nu] = a[nu] + 1 yield visit(n, a) a[mu] = 0 elif nu > mu + 1: if (a[nu] + sigma) % 2 == 1: for v in f(mu, nu - 1, 0, n, a): yield v else: for v in b(mu, nu - 1, 0, n, a): yield v while a[nu] < mu - 1: a[nu] = a[nu] + 1 if (a[nu] + sigma) % 2 == 1: for v in f(mu, nu - 1, 0, n, a): yield v else: for v in b(mu, nu - 1, 0, n, a): yield v if (mu + sigma) % 2 == 1: a[nu - 1] = 0 else: a[mu] = 0 if mu == 2: yield visit(n, a) else: for v in b(mu - 1, nu - 1, (mu + sigma) % 2, n, a): yield v n = len(ns) a = [0] * (n + 1) for j in xrange(1, m + 1): a[n - m + j] = j - 1 return f(m, n, 0, n, a) we know that number of k-subsets of a given list is equal to Stirling number and it could be very big for some large lists. the code above returns a Python generator that could generate all possible k-subset partitions for the given list with calling its next method. accordingly, if I want to get only one of these partitions randomly, I have to either call next method for some random times (which makes it really slow if the Stirling number is big) or use the itertools.islice method to get a slice of size one which is really slow as before. I'm trying to avoid listing all partitions because it would be waste of time and speed and even memory (because calculations are a lot and memory is important in my case). the question is how can I generate only one of k-subset partitions without generating the rest ? or at least make the procedure very faster than what explained above. I need the performance because I need to get only one of them each time and I'm running the application for maybe more than ten million times. I'd appreciate any help. EDIT: EXAMPLE list : { 1, 2, 3 } for k = 3: { {1}, {2}, {3} } for k = 2: { {1, 2}, {3} } { {1, 3}, {2} } { {1}, {2, 3} } and for k = 1: { {1, 2, 3} } consider k = 2, is there any way I can generate only one of these 3 partitions randomly, without generating the other 2? note that I want to generate random partition for any given k not only a random partition of any k which means if I set the k to 2 I would like to generate only one of these 3 not one of all 5. Regards, Mohammad
You can count Stirling numbers efficiently with a recursive algorithm by storing previously computed values: fact=[1] def nCr(n,k): """Return number of ways of choosing k elements from n""" while len(fact)<=n: fact.append(fact[-1]*len(fact)) return fact[n]/(fact[k]*fact[n-k]) cache = {} def count_part(n,k): """Return number of ways of partitioning n items into k non-empty subsets""" if k==1: return 1 key = n,k if key in cache: return cache[key] # The first element goes into the next partition # We can have up to y additional elements from the n-1 remaining # There will be n-1-y left over to partition into k-1 non-empty subsets # so n-1-y>=k-1 # y<=n-k t = 0 for y in range(0,n-k+1): t += count_part(n-1-y,k-1) * nCr(n-1,y) cache[key] = t return t Once you know how many choices there are, you can adapt this recursive code to generate a particular partition: def ith_subset(A,k,i): """Return ith k-subset of A""" # Choose first element x n = len(A) if n==k: return A if k==0: return [] for x in range(n): # Find how many cases are possible with the first element being x # There will be n-x-1 left over, from which we choose k-1 extra = nCr(n-x-1,k-1) if i<extra: break i -= extra return [A[x]] + ith_subset(A[x+1:],k-1,i) def gen_part(A,k,i): """Return i^th k-partition of elements in A (zero-indexed) as list of lists""" if k==1: return [A] n=len(A) # First find appropriate value for y - the extra amount in this subset for y in range(0,n-k+1): extra = count_part(n-1-y,k-1) * nCr(n-1,y) if i<extra: break i -= extra # We count through the subsets, and for each subset we count through the partitions # Split i into a count for subsets and a count for the remaining partitions count_partition,count_subset = divmod(i,nCr(n-1,y)) # Now find the i^th appropriate subset subset = [A[0]] + ith_subset(A[1:],y,count_subset) S=set(subset) return [subset] + gen_part([a for a in A if a not in S],k-1,count_partition) As an example, I've written a test program that produces different partitions of 4 numbers: def test(A): n=len(A) for k in [1,2,3,4]: t = count_part(n,k) print k,t for i in range(t): print " ",i,gen_part(A,k,i) test([1,2,3,4]) This code prints: 1 1 0 [[1, 2, 3, 4]] 2 7 0 [[1], [2, 3, 4]] 1 [[1, 2], [3, 4]] 2 [[1, 3], [2, 4]] 3 [[1, 4], [2, 3]] 4 [[1, 2, 3], [4]] 5 [[1, 2, 4], [3]] 6 [[1, 3, 4], [2]] 3 6 0 [[1], [2], [3, 4]] 1 [[1], [2, 3], [4]] 2 [[1], [2, 4], [3]] 3 [[1, 2], [3], [4]] 4 [[1, 3], [2], [4]] 5 [[1, 4], [2], [3]] 4 1 0 [[1], [2], [3], [4]] As another example, there are 10 million partitions of 1,2,3,..14 into 4 parts. This code can generate all partitions in 44 seconds with pypy. There are 50,369,882,873,307,917,364,901 partitions of 1,2,3,...,40 into 4 parts. This code can generate 10 million of these in 120 seconds with pypy running on a single processor. To tie things together, you can use this code to generate a single random partition of a list A into k non-empty subsets: import random def random_ksubset(A,k): i = random.randrange(0,count_part(len(A),k)) return gen_part(A,k,i)
tl;dr: The k-subset partitions for a given n and k can be divided into types, based on which elements are the first to go into as yet empty parts. Each of these types is represented by a bit pattern with n-1 bits of which k-1 are set. While the number of partitions is huge (given by the second Stirling number), the number of types is much smaller, e.g.: n = 21, k = 8 number of partitions: S(21,8) = 132,511,015,347,084 number of types: (n-1 choose k-1) = 77,520 Calculating how many partitions there are of each type is simple, based on the position of the zeros in the bit pattern. If you make a list of all the types (by iterating over all n:k bit patterns) and keep a running total of the number of partitions, you can then use a binary search on this list to find the type of the partition with a given rank (in Log2(n-1 choose k-1) steps; 17 for the example above), and then translate the bit pattern into a partition and calculate into which part each element goes (in n steps). Every part of this method can be done iteratively, requiring no recursion. Here's a non-recursive solution. I've tried to roll my own, but it may (partially) overlap with Peter's answer or existing methods. If you have a set of n elements, e.g. with n=8: {a,b,c,d,e,f,g,h} then k-subset partitions will take this shape, e.g. with k=5: {a,e} {b,c,h} {d} {f} {g} This partition can also be written as: 1,2,2,3,1,4,5,2 which lists which part each of the elements goes in. So this sequence of n digits with values from 1 to k represents a k-subset partition of n elements. However, not all such sequences are valid partitions; every digit from 1 to k must be present, otherwise there would be empty parts: 1,2,2,3,1,3,5,2 → {a,e} {b,c,h} {d,f} {} {g} Also, to avoid duplicates, digit x can only be used after digit x-1 has been used. So the first digit is always 1, the second can be at most 2, and so on. If in the example we use digits 4 and 5 before 3, we get duplicate partitions: 1,2,2,3,1,4,5,2 → {a,e} {b,c,h} {d} {f} {g} 1,2,2,4,1,5,3,2 → {a,e} {b,c,h} {g} {d} {f} When you group the partitions based on when each part is first used, you get these types: 1,1,1,1,2,3,4,5 0001111 11111111 1 1 1,1,1,2,12,3,4,5 0010111 11112111 2 2 1,1,1,2,3,123,4,5 0011011 11111311 3 3 1,1,1,2,3,4,1234,5 0011101 11111141 4 4 1,1,1,2,3,4,5,12345 0011110 11111115 5 5 1,1,2,12,12,3,4,5 0100111 11122111 2*2 4 1,1,2,12,3,123,4,5 0101011 11121311 2*3 6 1,1,2,12,3,4,1234,5 0101101 11121141 2*4 8 1,1,2,12,3,4,5,12345 0101110 11121115 2*5 10 1,1,2,3,123,123,4,5 0110011 11113311 3*3 9 1,1,2,3,123,4,1234,5 0110101 11113141 3*4 12 1,1,2,3,123,4,5,12345 0110110 11113115 3*5 15 1,1,2,3,4,1234,1234,5 0111001 11111441 4*4 16 1,1,2,3,4,1234,5,12345 0111010 11111415 4*5 20 1,1,2,3,4,5,12345,12345 0111100 11111155 5*5 25 1,2,12,12,12,3,4,5 1000111 11222111 2*2*2 8 1,2,12,12,3,123,4,5 1001011 11221311 2*2*3 12 1,2,12,12,3,4,1234,5 1001101 11221141 2*2*4 16 1,2,12,12,3,4,5,12345 1001110 11221115 2*2*5 20 1,2,12,3,123,123,4,5 1010011 11213311 2*3*3 18 1,2,12,3,123,4,1234,5 1010101 11213141 2*3*4 24 1,2,12,3,123,4,5,12345 1010110 11213115 2*3*5 30 1,2,12,3,4,1234,1234,5 1011001 11211441 2*4*4 32 1,2,12,3,4,1234,5,12345 1011010 11211415 2*4*5 40 1,2,12,3,4,5,12345,12345 1011100 11211155 2*5*5 50 1,2,3,123,123,123,4,5 1100011 11133311 3*3*3 27 1,2,3,123,123,4,1234,5 1100101 11133141 3*3*4 36 1,2,3,123,123,4,5,12345 1100110 11133115 3*3*5 45 1,2,3,123,4,1234,1234,5 1101001 11131441 3*4*4 48 1,2,3,123,4,1234,5,12345 1101010 11131415 3*4*5 60 1,2,3,123,4,5,12345,12345 1101100 11131155 3*5*5 75 1,2,3,4,1234,1234,1234,5 1110001 11114441 4*4*4 64 1,2,3,4,1234,1234,5,12345 1110010 11114415 4*4*5 80 1,2,3,4,1234,5,12345,12345 1110100 11114155 4*5*5 100 1,2,3,4,5,12345,12345,12345 1111000 11111555 5*5*5 125 SUM = 1050 In the above diagram, a partition of the type: 1,2,12,3,123,4,1234,5 means that: a goes into part 1 b goes into part 2 c goes into part 1 or 2 d goes into part 3 e goes into part 1, 2 or 3 f goes into part 4 g goes into part 1, 2, 3 or 4 h goes into part 5 So partitions of this type have a digit that can have 2 values, a digit that can have 3 values, and a digit that can have 4 values (this is indicated in the third column in the diagram above). So there are a total of 2 × 3 × 4 partitions of this type (as indicated in columns 4 and 5). The sum of these is of course the Stirling number: S(8,5) = 1050. The second column in the diagram is another way of notating the type of the partition: after starting with 1, every digit is either a digit that has been used before, or a step up (i.e. the highest digit used so far + 1). If we represent these two options by 0 and 1, we get e.g.: 1,2,12,3,123,4,1234,5 → 1010101 where 1010101 means: Start with 1 1 → step up to 2 0 → repeat 1 or 2 1 → step up to 3 0 → repeat 1, 2 or 3 1 → step up to 4 0 → repeat 1, 2, 3 or 4 1 → step up to 5 So every binary sequence with n-1 digits and k-1 ones represents a type of partition. We can calculate the number of partitions of a type by iterating over the digits from left to right, incrementing a factor when we find a one, and multiplying with the factor when we find a zero, e.g.: 1,2,12,3,123,4,1234,5 → 1010101 Start with product = 1, factor = 1 1 → increment factor: 2 0 → product × factor = 2 1 → increment factor: 3 0 → product × factor = 6 1 → increment factor: 4 0 → product × factor = 24 1 → increment factor: 5 And again for this example, we find that there are 24 partitions of this type. So, counting the partitions of each type can be done by iterating over all n-1-digit integers with k-1 digits set, using any method (e.g. Gosper's Hack): 0001111 1 1 0010111 2 3 0011011 3 6 0011101 4 10 0011110 5 15 0100111 4 19 0101011 6 25 0101101 8 33 0101110 10 43 0110011 9 52 0110101 12 64 0110110 15 79 0111001 16 95 0111010 20 115 0111100 25 140 1000111 8 148 1001011 12 160 1001101 16 176 1001110 20 196 1010011 18 214 1010101 24 238 1010110 30 268 1011001 32 300 1011010 40 340 1011100 50 390 1100011 27 417 1100101 36 453 1100110 45 498 1101001 48 546 1101010 60 606 1101100 75 681 1110001 64 745 1110010 80 825 1110100 100 925 1111000 125 1050 Finding a random partition then means choosing a number from 1 to S(n,k), going over the counts per partition type while keeping a running total (column 3 above), and picking the corresponding partition type, and then calculating the value of the repeated digits, e.g.: S(8,5) = 1050 random pick: e.g. 333 type: 1011010 → 1,2,12,3,4,1234,5,12345 range: 301 - 340 variation: 333 - 301 = 32 digit options: 2, 4, 5 digit values: 20, 5, 1 variation: 32 = 1 × 20 + 2 × 5 + 2 × 1 digits: 1, 2, 2 (0-based) → 2, 3, 3 (1-based) partition: 1,2,2,3,4,3,5,3 and the 333rd partition of 8 elements in 5 parts is: 1,2,2,3,4,3,5,3 → {a} {b,c} {d,f,h} {e} {g} There are a number of options to turn this into code; if you store the n-1-digit numbers as a running total, you can do subsequent lookups using a binary search over the list, whose length is C(n-1,k-1), to reduce time complexity from O(C(n-1,k-1)) to O(Log2(C(n-1,k-1))). I've made a first test in JavaScript (sorry, I don't speak Python); it's not pretty but it demonstrates the method and is quite fast. The example is for the case n=21 and k=8; it creates the count table for 77,520 types of partitions, returns the total number of partitions 132,511,015,347,084 and then retrieves 10 randomly picked partitions within that range. On my computer this code returns a million randomly selected partitions in 3.7 seconds. (note: the code is zero-based, unlike the explanation above) function kSubsetPartitions(n, k) { // Constructor this.types = []; this.count = []; this.total = 0; this.elems = n; var bits = (1 << k - 1) - 1, done = 1 << n - 1; do { this.total += variations(bits); this.types.push(bits); this.count.push(this.total); } while (!((bits = next(bits)) & done)); function variations(bits) { var product = 1, factor = 1, mask = 1 << n - 2; while (mask) { if (bits & mask) ++factor; else product *= factor; mask >>= 1; } return product; } function next(a) { // Gosper's Hack var c = (a & -a), r = a + c; return (((r ^ a) >> 2) / c) | r; } } kSubsetPartitions.prototype.partition = function(rank) { var range = 1, type = binarySearch(this.count, rank); if (type) { rank -= this.count[type - 1]; range = this.count[type] - this.count[type - 1]; } return translate(this.types[type], this.elems, range, rank); // This translates the bit pattern format and creates the correct partition // for the given rank, using a letter format for demonstration purposes function translate(bits, len, range, rank) { var partition = [["A"]], part, max = 0, mask = 1 << len - 2; for (var i = 1; i < len; i++, mask >>= 1) { if (!(bits & mask)) { range /= (max + 1); part = Math.floor(rank / range); rank %= range; } else part = ++max; if (!partition[part]) partition[part] = ""; partition[part] += String.fromCharCode(65 + i); } return partition.join(" / "); } function binarySearch(array, value) { var low = 0, mid, high = array.length - 1; while (high - low > 1) { mid = Math.ceil((high + low) / 2); if (value < array[mid]) high = mid; else low = mid; } return value < array[low] ? low : high; } } var ksp = new kSubsetPartitions(21, 8); document.write("Number of k-subset partitions for n,k = 21,8 → " + ksp.total.toLocaleString("en-US") + "<br>"); for (var tests = 10; tests; tests--) { var rnd = Math.floor(Math.random() * ksp.total); document.write("Partition " + rnd.toLocaleString("en-US", {minimumIntegerDigits: 15}) + " → " + ksp.partition(rnd) + "<br>"); } It isn't really necessary to store the bit patterns for each partition type, because they can be recreated from their index (see e.g. the second algorithm in this answer). If you only store the running total of the number of variations per partition type, that halves the memory requirement. This second code example in C++ stores only the counts, and returns the partition as an n-length array containing the part number for each element. Usage example at the end of the code. On my computer it creates the count list for n=40 and k=32 in 12 seconds and then returns 10 million partitions in 24 seconds. Values of n can go up to 65 and k up to 64, but for some combinations the number of partitions will be greater than 264, which this code obviously can't handle. If you translate it into Python, there should be no such restrictions. (Note: enable zero check in binomial coefficient function if k=1.) class kSubsetPartitions { std::vector <uint64_t> count; uint64_t total; uint8_t n; uint8_t k; public: kSubsetPartitions(uint8_t n, uint8_t k) { this->total = 0; this->n = n; this->k = k; uint64_t bits = ((uint64_t) 1 << k - 1) - 1; uint64_t types = choose(n - 1, k - 1); this->count.reserve(types); while (types--) { this->total += variations(bits); this->count.push_back(this->total); bits = next(bits); } } uint64_t range() { return this->total; } void partition(uint64_t rank, uint8_t *buffer) { uint64_t range = 1; uint64_t type = binarySearch(rank); if (type) { rank -= this->count[type - 1]; range = this->count[type] - this->count[type - 1]; } format(pattern(type), range, rank, buffer); } private: uint64_t pattern(uint64_t type) { uint64_t steps, bits = 0, mask = (uint64_t) 1 << this->n - 2; uint8_t ones = this->k - 1; for (uint8_t i = this->n - 1; i; i--, mask >>= 1) { if (i > ones) { steps = choose(i - 1, ones); if (type >= steps) { type -= steps; bits |= mask; --ones; } } else bits |= mask; } return bits; } uint64_t choose(uint8_t x, uint8_t y) { // C(x,y) using Pascal's Triangle static std::vector <std::vector <uint64_t> > triangle; if (triangle.empty()) { triangle.resize(this->n); triangle[0].push_back(1); for (uint8_t i = 1; i < this->n; i++) { triangle[i].push_back(1); for (uint8_t j = 1; j < i; j++) { triangle[i].push_back(triangle[i - 1][j - 1] + triangle[i - 1][j]); } triangle[i].push_back(1); } } return triangle[x][y]; } void format(uint64_t bits, uint64_t range, uint64_t rank, uint8_t *buffer) { uint64_t mask = (uint64_t) 1 << this->n - 2; uint8_t max = 0, part; *buffer = 0; while (mask) { if (!(bits & mask)) { range /= (max + 1); part = rank / range; rank %= range; } else part = ++max; *(++buffer) = part; mask >>= 1; } } uint64_t binarySearch(uint64_t rank) { uint64_t low = 0, mid, high = this->count.size() - 1; while (high - low > 1) { mid = (high + low + 1) / 2; if (rank < this->count[mid]) high = mid; else low = mid; } return rank < this->count[low] ? low : high; } uint64_t variations(uint64_t bits) { uint64_t product = 1; uint64_t mask = (uint64_t) 1 << this->n - 2; uint8_t factor = 1; while (mask) { if (bits & mask) ++factor; else product *= factor; mask >>= 1; } return product; } uint64_t next(uint64_t a) { // Gosper's Hack // if (!a) return a; // k=1 => a=0 => c=0 => division by zero! uint64_t c = (a & -a), r = a + c; return (((r ^ a) >> 2) / c) | r; } }; // USAGE EXAMPLE: // uint8_t buffer[40]; // kSubsetPartitions* ksp = new kSubsetPartitions(40, 32); // uint64_t range = ksp->range(); // ksp->partition(any_integer_below_range, buffer); Below is an overview of the values of n and k that result in more than 264 partitions, and cause overflow in the code above. Up to n=26, all values of k give valid results. 25: - 26: - 27: 8-13 28: 7-15 29: 6-17 30: 6-18 31: 5-20 32: 5-21 33: 5-22 34: 4-23 35: 4-25 36: 4-26 37: 4-27 38: 4-28 39: 4-29 40: 4-31 41: 4-32 42: 4-33 43: 3-34 44: 3-35 45: 3-36 46: 3-37 47: 3-38 48: 3-39 49: 3-40 50: 3-42 51: 3-43 52: 3-44 53: 3-45 54: 3-46 55: 3-47 56: 3-48 57: 3-49 58: 3-50 59: 3-51 60: 3-52 61: 3-53 62: 3-54 63: 3-55 64: 3-56 65: 3-57 A version which doesn't store the number of partitions per type is possible, and would require almost no memory. Looking up the partitions that correspond to randomly selected integers would be slower, but if the selection of integers was sorted, it could be even faster than the version which requires binary sort for every lookup. You'd start with the first bit pattern, calculate the number of partitions of this type, see if the first integer(s) fall into this range, calculate their partitions, and then move on to the next bit pattern.
How about something like this: import itertools import random def random_ksubset(ls, k): # we need to know the length of ls, so convert it into a list ls = list(ls) # sanity check if k < 1 or k > len(ls): return [] # Create a list of length ls, where each element is the index of # the subset that the corresponding member of ls will be assigned # to. # # We require that this list contains k different values, so we # start by adding each possible different value. indices = list(range(k)) # now we add random values from range(k) to indices to fill it up # to the length of ls indices.extend([random.choice(list(range(k))) for _ in range(len(ls) - k)]) # shuffle the indices into a random order random.shuffle(indices) # construct and return the random subset: sort the elements by # which subset they will be assigned to, and group them into sets return [{x[1] for x in xs} for (_, xs) in itertools.groupby(sorted(zip(indices, ls)), lambda x: x[0])] This produces random k-subset partitions like so: >>> ls = {1,2,3} >>> print(random_ksubset(ls, 2)) [set([1, 2]), set([3])] >>> print(random_ksubset(ls, 2)) [set([1, 3]), set([2])] >>> print(random_ksubset(ls, 2)) [set([1]), set([2, 3])] >>> print(random_ksubset(ls, 2)) [set([1]), set([2, 3])] This method satisfies OP's requirement of getting one randomly-generated partition, without enumerating all possible partitions. Memory complexity here is linear. Run-time complexity is O(N log N) due to the sort. I suppose it might be possible to get this down to linear, if that was important, using a more complicated method of constructing the return value. As #Leon points out, this satisfies the requirements of his option 2 in trying to define the problem. What this won't do is deterministically generate partition #N (this is Leon's option 1, which would allow you to randomly pick an integer N and then retrieve the corresponding partition). Leon's clarification is important, because, to satisfy the spirit of the question, every possible partition of the collection should be generated with equal probability. On our toy problem, this is the case: >>> from collections import Counter >>> Counter(frozenset(map(frozenset, random_ksubset(ls, 2))) for _ in range(10000)) Counter({frozenset({frozenset({2, 3}), frozenset({1})}): 3392, frozenset({frozenset({1, 3}), frozenset({2})}): 3212, frozenset({frozenset({1, 2}), frozenset({3})}): 3396}) However. In general, this method does not generate each partition with equal probability. Consider: >>> Counter(frozenset(map(frozenset, random_ksubset(range(4), 2))) ... for _ in range(10000)).most_common() [(frozenset({frozenset({1, 3}), frozenset({0, 2})}), 1671), (frozenset({frozenset({1, 2}), frozenset({0, 3})}), 1667), (frozenset({frozenset({2, 3}), frozenset({0, 1})}), 1642), (frozenset({frozenset({0, 2, 3}), frozenset({1})}), 1285), (frozenset({frozenset({2}), frozenset({0, 1, 3})}), 1254), (frozenset({frozenset({0, 1, 2}), frozenset({3})}), 1245), (frozenset({frozenset({1, 2, 3}), frozenset({0})}), 1236)] We can see here that we are more likely to generate "more balanced" partitions (because there are more ways to construct these). The partitions that contain singleton sets are produced less frequently. It seems that an efficient uniform sampling method over k-partitions of sets is sort of an unsolved research question (also see mathoverflow). Nijenhuis and Wilf give code for sampling from all partitions (Chapter 12), which could work with rejection testing, and #PeterdeRivaz's answer can also uniformly sample a k-partition. The drawback with both of these methods is that they require computing the Stirling numbers, which grow exponentially in n, and the algorithms are recursive, which I think will make them slow on large inputs. As you mention "millions" of partitions in your comment, I think that these approaches will only be tractable up to a certain input size. A. Nijenhuis and H. Wilf. Combinatorial Algorithms for Computers and Calculators. Academic Press, Orlando FL, second edition, 1978. Exploring Leon's option 1 might be interesting. Here's a rough procedure to deterministically produce a particular partition of a collection using #Amadan's suggestion of taking an integer value interpreted as a k-ary number. Note that not every integer value produces a valid k-subset partition (because we disallow empty subsets): def amadan(ls, N, k): """ Given a collection `ls` with length `b`, a value `k`, and a "partition number" `N` with 0 <= `N` < `k**b`, produce the Nth k-subset paritition of `ls`. """ ls = list(ls) b = len(ls) if not 0 <= N < k**b: return None # produce the k-ary index vector from the number N index = [] # iterate through each of the subsets for _ in range(b): index.append(N % k) N //= k # subsets cannot be empty if len(set(index)) != k: return None return frozenset(frozenset(x[1] for x in xs) for (_, xs) in itertools.groupby(sorted(zip(index, ls)), lambda x:x[0])) We can confirm that this generates the Stirling numbers properly: >>> for i in [(4,1), (4,2), (4,3), (4,4), (5,1), (5,2), (5,3), (5,4), (5,5)]: ... b,k = i ... r = [amadan(range(b), N, k) for N in range(k**b)] ... r = [x for x in r if x is not None] ... print(i, len(set(r))) (4, 1) 1 (4, 2) 7 (4, 3) 6 (4, 4) 1 (5, 1) 1 (5, 2) 15 (5, 3) 25 (5, 4) 10 (5, 5) 1 This may also be able to produce each possible partition with equal probability; I'm not quite sure. Here's a test case, where it works: >>> b,k = 4,3 >>> r = [amadan(range(b), N, k) for N in range(k**b)] >>> r = [x for x in r if x is not None] >>> print(Counter([' '.join(sorted(''.join(map(str, x)) for x in p)) for p in r])) Counter({'0 13 2': 6, '01 2 3': 6, '0 12 3': 6, '03 1 2': 6, '02 1 3': 6, '0 1 23': 6}) Another working case: >>> b,k = 5,4 >>> r = [amadan(range(b), N, k) for N in range(k**b)] >>> r = [x for x in r if x is not None] >>> print(Counter([' '.join(sorted(''.join(map(str, x)) for x in p)) for p in r])) Counter({'0 12 3 4': 24, '04 1 2 3': 24, '0 1 23 4': 24, '01 2 3 4': 24, '03 1 2 4': 24, '0 13 2 4': 24, '0 1 24 3': 24, '02 1 3 4': 24, '0 1 2 34': 24, '0 14 2 3': 24}) So, to wrap this up in a function: def random_ksubset(ls, k): ls = list(ls) maxn = k**len(ls)-1 rv = None while rv is None: rv = amadan(ls, random.randint(0, maxn), k) return rv And then we can do: >>> random_ksubset(range(3), 2) frozenset({frozenset({2}), frozenset({0, 1})}) >>> random_ksubset(range(3), 2) frozenset({frozenset({1, 2}), frozenset({0})}) >>> random_ksubset(range(3), 2) frozenset({frozenset({1, 2}), frozenset({0})}) >>> random_ksubset(range(3), 2) frozenset({frozenset({2}), frozenset({0, 1})})
Integer list to ranges
I need to convert a list of ints to a string containing all the ranges in the list. So for example, the output should be as follows: getIntRangesFromList([1,3,7,2,11,8,9,11,12,15]) -> "1-3,7-9,11-12,15" So the input is not sorted and there can be duplicate values. The lists range in size from one element to 4k elements. The minimum and maximum values are 1 and 4094. This is part of a performance critical piece of code. I have been trying to optimize this, but I can't find a way to get this faster. This is my current code: def _getIntRangesFromList(list): if (list==[]): return '' list.sort() ranges = [[list[0],list[0]]] # ranges contains the start and end values of each range found for val in list: r = ranges[-1] if val==r[1]+1: r[1] = val elif val>r[1]+1: ranges.append([val,val]) return ",".join(["-".join([str(y) for y in x]) if x[0]!=x[1] else str(x[0]) for x in ranges]) Any idea on how to get this faster?
This could be a task for the itertools module. import itertools list_num = [1, 2, 3, 7, 8, 9, 11, 12, 15] groups = (list(x) for _, x in itertools.groupby(list_num, lambda x, c=itertools.count(): x - next(c))) print(', '.join('-'.join(map(str, (item[0], item[-1])[:len(item)])) for item in groups)) This will give you 1-3, 7-9, 11-12, 15. To understand what's going on you might want to check the content of groups. import itertools list_num = [1, 2, 3, 7, 8, 9, 11, 12, 15] groups = (list(x) for _, x in itertools.groupby(list_num, lambda x, c=itertools.count(): x - next(c))) for element in groups: print('element={}'.format(element)) This will give you the following output. element=[1, 2, 3] element=[7, 8, 9] element=[11, 12] element=[15] The basic idea is to have a counter running parallel to the numbers. groupby will create individual groups for numbers with the same numerical distance to the current value of the counter. I don't know whether this is faster on your version of Python. You'll have to check this yourself. In my setting it's slower with this data set, but faster with a bigger number of elements.
The fastest one I could come up, which tests about 10% faster than your solution on my machine (according to timeit): def _ranges(l): if l: l.sort() return ''.join([(str(l[i]) + ('-' if l[i] + 1 == l[i + 1] else ',')) for i in range(0, len(l) - 1) if l[i - 1] + 2 != l[i + 1]] + [str(l[-1])]) else: return '' The above code assumes that the values in the list are unique. If they aren't, it's easy to fix but there's a subtle hack which will no longer work and the end result will be slightly slower. I actually timed _ranges(u[:]) because of the sort; u is 600 randomly selected integers from range(1000) comprising 235 subsequences; 83 are singletons and 152 contain at least two numbers. If the list is sorted, quite a lot of time is saved.
def _to_range(l, start, stop, idx, result): if idx == len(l): result.append((start, stop)) return result if l[idx] - stop > 1: result.append((start, stop)) return _to_range(l, l[idx], l[idx], idx + 1, result) return _to_range(l, start, l[idx], idx + 1, result) def get_range(l): if not l: return [] return _to_range(l, start = l[0], stop = l[0], idx = 0, result = []) l = [1, 2, 3, 7, 8, 9, 11, 12, 15] result = get_range(l) print(result) >>> [(1, 3), (7, 9), (11, 12), (15, 15)] # I think it's better to fetch the data as it is and if needed, change it # with print(','.join('-'.join([str(start), str(stop)]) for start, stop in result)) >>> 1-3,7-9,11-12,15-15 Unless you don't care at all about the data, then u can just append str(start) + '-' + str(stop) in _to_range function so later there will be no need to type extra '-'.join method.
I'll concentrate on the performance that is your main issue. I'll give 2 solutions: 1) If the boundaries of the integers stored is between A and B, and you can create an array of booleans(even you can choose an array of bits for expanding the range you can storage) with (B - A + 2) elements, e.g. A = 0 and B = 1 000 000, we can do this (i'll write it in C#, sorry XD). This run in O(A - B) and is a good solution if A - B is less than the number of numbers: public string getIntRangesFromList(int[] numbers) { //You can change this 2 constants const int A = 0; const int B = 1000000; //Create an array with all its values in false by default //Last value always will be in false in propourse, as you can see it storage 1 value more than needed for 2nd cycle bool[] apparitions = new bool[B - A + 2]; int minNumber = B + 1; int maxNumber = A - 1; int pos; for (int i = 0; i < numbers.Length; i++) { pos = numbers[i] - A; apparitions[pos] = true; if (minNumber > pos) { minNumber = pos; } if (maxNumber < pos) { maxNumber = pos; } } //I will mantain the concatenation simple, but you can make it faster to improve performance string result = ""; bool isInRange = false; bool isFirstRange = true; int firstPosOfRange = 0; //Irrelevant what is its initial value for (int i = minNumber; i <= maxNumber + 1; i++) { if (!isInRange) { if (apparitions[i]) { if (!isFirstRange) { result += ","; } else { isFirstRange = false; } result += (i + A); isInRange = true; firstPosOfRange = i; } } else { if (!apparitions[i]) { if (i > firstPosOfRange + 1) { result += "-" + (i + A - 1); } isInRange = false; } } } return result; } 2) O(N * log N) public string getIntRangesFromList2(int[] numbers) { string result = ""; if (numbers.Length > 0) { numbers.OrderBy(x => x); //sorting and making the algorithm complexity O(N * log N) result += numbers[0]; int countNumbersInRange = 1; for (int i = 1; i < numbers.Length; i++) { if (numbers[i] != numbers[i - 1] + 1) { if (countNumbersInRange > 1) { result += "-" + numbers[i - 1]; } result += "," + numbers[i]; countNumbersInRange = 1; } else { countNumbersInRange++; } } } return result; }
How to implement/construct the following permutation, given two n-tuples, efficiently?
I am studying queuing theory in which I am frequently presented with the following situation. Let x, y both be n-tuples of nonnegative integers (depicting lengths of the n queues). In addition, x and y each have distinguished queue called their "prime queue". For example, x = [3, 6, 1, 9, 5, 2] with x' = 1 y = [6, 1, 5, 9, 5, 5] with y' = 5 (In accordance with Python terminology I am counting the queues 0-5.) How can I implement/construct the following permutation f on {0,1,...,5} efficiently? first set f(x') = y'. So here f(1) = 5. then set f(i) = i for any i such that x[i] == y[i]. Clearly there is no need to consider the indices x' and y'. So here f(3) = 3 (both length 9) and f(4) = 4 (both length 5). there are now equally sized sets of queues unpaired in x and in y. So here in x this is {0,2,5} and in y this is {0,1,2}. rank these from from 1 to s, where s is the common size of the sets, by length with 1 == lowest rank == shortest queue and s == highest rank == longest queue. So here, s = 3, and in x rank(0) = 1, rank(2) = 3 and rank(5) = 2, and in y rank(0) = 1, rank(1) = 3, rank(2) = 2. If there is a tie, give the queue with the larger index the higher rank. pair these s queues off by rank. So here f(0) = 0, f(2) = 1, f(5) = 2. This should give the permutation [0, 5, 1, 3, 4, 2]. My solution consists of tracking the indices and loops over x and y multiple times, and is terribly inefficient. (Roughly looking at n >= 1,000,000 in my application.) Any help would be most appreciated.
Since you must do the ranking, you can't get linear and will need to sort. So it looks pretty straightforward. You do 1. in O(1) and 2. in O(n) by just going over the n-tuples. At the same time, you can construct the copy of x and y with only those that are left for 3. but do not include only the value, but instead use tuple of value and its index in the original. In your example, x-with-tuples-left would be [[3,0],[1,2],[2,5]] and y-with-tuples-left would be [[6,0],[1,1],[5,2]]. Then just sort both x-with-tuples-left and y-with-tuples-left (it will be O(n.log n)), and read the permutation from the second element of the corresponding tuples. In your example, sorted x-with-... would be [[1,2],[2,5],[3,0]] and sorted y-with-... would be [[1,1],[5,2],[6,0]]. Now, you nicely see 5. from the second elements: f(2)=1, f(5)=2, f(0)=0. EDIT: Including O(n+L) in Javascript: function qperm (x, y, xprime, yprime) { var i; var n = x.length; var qperm = new Array(n); var countsx = [], countsy = []; // same as new Array() qperm[xprime] = yprime; // doing 1. for (i = 0; i < n; ++i) { if (x[i] == y[i] && i != xprime && i != yprime) { // doing 2. qperm[i] = i; } else { // preparing for 4. below if (i != xprime) { if (countsx[x[i]]) countsx[x[i]]++; else countsx[x[i]] = 1; } if (i != yprime) { if (countsy[y[i]]) countsy[y[i]]++; else countsy[y[i]] = 1; } } // finishing countsx and countsy var count, sum; for (i = 0, count = 0; i < countsx.length; ++i) { if (countsx[i]) { sum = count + countsx[i]; countsx[i] = count; count = sum; } for (i = 0, count = 0; i < countsy.length; ++i) { if (countsy[i]) { sum = count + countsy[i]; countsy[i] = count; count = sum; } var yranked = new Array(count); for (i = 0; i < n; ++i) { if (i != yprime && (x[i] != y[i] || i == xprime)) { // doing 4. for y yranked[countsy[y[i]]] = y[i]; countsy[y[i]]++; } } for (i = 0; i < n; ++i) { if (i != xprime && (x[i] != y[i] || i == yprime)) { // doing 4. for x and 5. at the same time // this was here but was not right: qperm[x[i]] = yranked[countsx[x[i]]]; qperm[i] = yranked[countsx[x[i]]]; // this was here but was not right: countsy[y[i]]++; } } } countsx[x[i]]++; } } return qperm; } Hopefully it's correct ;-)