Finding a subset whose sum is zero - python

I have an Excel file containing two rows, one containing numbers, another having ID, serial number basically.
Now, these numbers are both positive and negative. I have to find if there exists a subset whose sum is zero. If yes, then the IDs of the element of the subsets whose sum is zero. It would be awesome if I could also find all subsets smaller than a number, say 3 or 4.
The most important part is that I want the IDs of the numbers.
def subset_sum_exists(numbers, target):
n = len(numbers)
# Initialize a 2D array with size (n+1) * (target+1)
dp = [[False for _ in range(target + 1)] for _ in range(n + 1)]
# Initialize the first column with true
for i in range(n + 1):
dp[i][0] = True
for i in range(1, n + 1):
for j in range(1, target + 1):
# if j is less than current number
if j < numbers[i - 1]:
dp[i][j] = dp[i - 1][j]
# if j is greater or equal than current number
if j >= numbers[i - 1]:
dp[i][j] = dp[i - 1][j] or dp[i - 1][j - numbers[i - 1]]
# return the last element
return dp[n][target]
This code should at least tell me if such a subset exists or not, but seems like there is some error in this. For target = 0, it always says true. For others, it gives an error.
numbers = [6, 20, 54, 93, -54, -26]

Related

Is it possible to process string from starting for DP solution

I was trying out longest palindromic subsequence problem from leetcode.
One of the discussed solution is as follows:
class Solution:
def longestPalindromeSubseq(self, s: str) -> int:
n = len(s)
dp = [[0] * n for _ in range(n)]
for i in range(n - 1, -1, -1):
dp[i][i] = 1
for j in range(i+1, n):
if s[i] == s[j]:
dp[i][j] = dp[i + 1][j - 1] + 2
else:
dp[i][j] = max(dp[i + 1][j], dp[i][j - 1])
return dp[0][n - 1]
So it starts from end of the string:
I was guessing if it is possible to begin from the starting of the string. That is if its possible to have loops something like this:
for i in range(0, n):
for j in range(i+1, n):
# ...
But dp[i + 1] wont be calculated for any given iteration of i and we need dp[i+1] for evaluating
dp[i][j] = dp[i + 1][j - 1] + 2 and
dp[i][j] = max(dp[i + 1][j], dp[i][j - 1])
Is it possible to change these two updates to dp (and hence come up with new recurrence relation) in some way to make it possible to begin from the starting of the string or starting from the end of the string is the only way possible !? (I was not able to come up with any recurrence solution / index adjustments to make it possible. So I have started to believe that its indeed not possible. But I wanted to be sure.)
The first hint that you can do this from the beginning is that let's say you're given a string 'baabbcc' that this logic gets the answer for, the same logic will work for the reversed string as well ('ccbbaab').
The more robust reasoning for this can be derived from what dp[i][j] represents. The value represents the Longest Palindromic Subsequence between i and j inclusive. We calculate this dp array using two pointers, say i and j.
We iterate over all possible values of i and j, and if s[i] == s[j] then we know that the answer from i to j will be equal to the answer for i+1 to j-1 + 2 because we can take the answer from i+1 to j-1 and add s[i] and s[j] to the beginning and end of that. I hope this is clear from the code you provided.
What that means is that to calculate dp[i][j], you need dp[i+1][j-1].
The code you have provided does this by starting the i pointer from the ending and for every i, it loops from j = i till j = n-1. This means that i+1 is reached before i and j-1 is reached before j.
However, you can achieve the same effect starting from the beginning. This time, start by moving the j pointer from the beginning, and for every j, move the i pointer backward from i = j till i = 0. This ensures that j-1 is reached before j and i+1 is reached before i, which is what we're looking for.
The final code would look something like this (Which I've submitted and gotten accepted):
class Solution:
def longestPalindromeSubseq(self, s: str) -> int:
n = len(s)
dp = [[0] * n for _ in range(n)]
for j in range(0, n):
dp[j][j] = 1
for i in range(j-1, -1, -1):
if s[i] == s[j]:
dp[i][j] = dp[i + 1][j - 1] + 2
else:
dp[i][j] = max(dp[i + 1][j], dp[i][j - 1])
return dp[0][n - 1]

How to tell if number can be writen as sum of n different squares?

I know how to check if the number can be represented as the sum of two squares with a brute-force approach.
def sumSquare( n) :
i = 1
while i * i <= n :
j = 1
while(j * j <= n) :
if (i * i + j * j == n) :
print(i, "^2 + ", j , "^2" )
return True
j = j + 1
i = i + 1
return False
But how to do it for n distinct positive integers. So the question would be:
Function which checks if the number can be written as sum of 'n' different squares
I have some examples.
For e.g.
is_sum_of_squares(18, 2) would be false because 18 can be written as the sum of two squares (3^2 + 3^2) but they are not distinct.
(38,3) would be true because 5^2+3^2+2^2 = 38 and 5!=3!=2.
I can't extend the if condition for more values. I think it could be done with recursion, but I have problems with it.
I found this function very useful since it finds the number of squares the number can be split into.
def findMinSquares(n):
T = [0] * (n + 1)
for i in range(n + 1):
T[i] = i
j = 1
while j * j <= i:
T[i] = min(T[i], 1 + T[i - j * j])
j += 1
return T[n]
But again I can't do it with recursion. Sadly I can't wrap my head around it. We started learning it a few weeks ago (I am in high school) and it is so different from the iterative approach.
Recursive approach:
def is_sum_of_squares(x, n, used=None):
x_sqrt = int(x**0.5)
if n == 1:
if x_sqrt**2 == x:
return used.union([x_sqrt])
return None
used = used or set()
for i in set(range(max(used, default=0)+1, int((x/n)**0.5))):
squares = is_sum_of_squares(x-i**2, n-1, used.union([i]))
if squares:
return squares
return None
Quite a compelling exercise. I have attempted solving it using recursion in a form of backtracking. Start with an empty list, run a for loop to add numbers to it from 1 to max feasible (square root of target number) and for each added number continue with recursion. Once the list reaches the required size n, validate the result. If the result is incorrect, backtrack by removing the last number.
Not sure if it is 100% correct though. In terms of speed, I tried it on the (1000,13) input and the process finished reasonably fast (3-4s).
def is_sum_of_squares(num, count):
max_num = int(num ** 0.5)
return backtrack([], num, max_num, count)
def backtrack(candidates, target, max_num, count):
"""
candidates = list of ints of max length <count>
target = sum of squares of <count> nonidentical numbers
max_num = square root of target, rounded
count = desired size of candidates list
"""
result_num = sum([x * x for x in candidates]) # calculate sum of squares
if result_num > target: # if sum exceeded target number stop recursion
return False
if len(candidates) == count: # if candidates reach desired length, check if result is valid and return result
result = result_num == target
if result: # print for result sense check, can be removed
print("Found: ", candidates)
return result
for i in range(1, max_num + 1): # cycle from 1 to max feasible number
if candidates and i <= candidates[-1]:
# for non empty list, skip numbers smaller than the last number.
# allow only ascending order to eliminate duplicates
continue
candidates.append(i) # add number to list
if backtrack(candidates, target, max_num, count): # next recursion
return True
candidates.pop() # if combination was not valid then backtrack and remove the last number
return False
assert(is_sum_of_squares(38, 3))
assert(is_sum_of_squares(30, 3))
assert(is_sum_of_squares(30, 4))
assert(is_sum_of_squares(36, 1))
assert not(is_sum_of_squares(35, 1))
assert not(is_sum_of_squares(18, 2))
assert not(is_sum_of_squares(1000, 13))

Maximum sum path in the matrix with a given starting point

I am learning to tackle a similar type of dynamic programming problem to find a maximum path sum in a matrix.
I have based my learning on this algorithm on the website below.
Source: Maximum path sum in matrix
The problem I am trying to solve is a little bit different from the one on the website.
The algorithm from the website makes use of max() to update values in the matrix to find max values to create a max path.
For example, given an array:
sample = [[110, 111, 108, 1],
[9, 8, 7, 2],
[4, 5, 10, 300],
[1, 2, 3, 4]]
The best sum path is 111 + 7 + 300 + 4 = 422
In the example above, the algorithm finds the first path by finding what is the max value of the first row of the matrix.
My question is, what if have to specify the starting point of the algorithm. The value h is given as the first element to start.
For example, given the sample array above, if h = 0, we need to start at sample[0][h], therefore the best path would be
110 (Our staring point) + 8 + 10 + 4 = 132
As you can see, the path can only travel downwards or adjacent, therefore if we start at h = 0, there will be values that we cannot reach some values such as 300.
Here is my messy attempt of solving this within the O(N*D) complexity,
# Find max path given h as a starting point
def find_max_path_w_start(mat, h):
res = mat[0][0]
M = len(mat[0])
N = len((mat))
for i in range(1, N):
res = 0
for j in range(M):
# Compute the ajacent sum of the ajacent values from h
if i == 1:
# If h is starting area, then compute the sum, find the max
if j == h:
# All possible
if (h > 0 and h < M - 1):
mat[1][h + 1] += mat[0][h]
mat[1][h] += mat[0][h]
mat[1][h - 1] += mat[0][h]
print(mat)
# Diagona Right not possible
elif (h > 0):
mat[1][h] += mat[0][h]
mat[1][h - 1] += mat[0][h]
# Diagonal left not possible
elif (h < M - 1):
mat[1][h] += mat[0][h]
mat[1][h + 1] += mat[0][h]
# Ignore value that has been filled.
elif j == h + 1 or j == h - 1 :
pass
# Other elements that cannot reach, make it -1
elif j > h + 1 or j < h - 1:
mat[i][j] = -1
else:
# Other elements that cannot reach, make it -1
if j > h + 1 or j < h - 1:
mat[i][j] = -1
else:
# When all paths are possible
if (j > 0 and j < M - 1):
mat[i][j] += max(mat[i - 1][j],
max(mat[i - 1][j - 1],
mat[i - 1][j + 1]))
# When diagonal right is not possible
elif (j > 0):
mat[i][j] += max(mat[i - 1][j],
mat[i - 1][j - 1])
# When diagonal left is not possible
elif (j < M - 1):
mat[i][j] += max(mat[i - 1][j],
mat[i - 1][j + 1])
res = max(mat[i][j], res)
return res
My approach is to only store the reachable values, if example if we start at h = 0, since we are starting at mat[0][h], we can only compute the sum of current and bottom max(mat[1][h] and sum of current and adjacent right mat[1][h + 1]), for other values I mark it as -1 to mark it as unreachable.
This doesn't return the expected sum at the end.
Is my logic incorrect? Are there other values that I need to store to complete this?
You can set all elements of the first row except h to negative infinity, and compute the answer as if there is no starting point restriction.
For example, put this piece of code at the start of your code
for i in range(M):
if i != h:
mat[0][i] = -1e100
Here is a solution which works in a similar way to yours, however it only calculates path sums for at matrix values that could have started at h.
def find_max_path_w_start(mat, h):
M = len(mat[0])
N = len((mat))
for i in range(1, N):
# `h - i` is the left hand side of a triangle with `h` as the top point.
# `max(..., 0)` ensures that is is at least 0 and in the matrix.
min_j = max(h - i, 0)
# similar to above, but the right hand side of the triangle.
max_j = min(h + i, M - 1)
for j in range(min_j, max_j + 1):
# min_k and max_k are the start and end indices of the points in the above
# layer which could potentially lead to a correct solution.
# Generally, you want to iterate from `j - 1` up to `j + 1`,
# however if at the edge of the triangle, do not take points from outside the triangle:
# this leads to the `h - i + 1` and `h + i - 1`.
# The `0` and `M - 1` prevent values outside the matrix being sampled.
min_k = max(j - 1, h - i + 1, 0)
max_k = min(j + 1, h + i - 1, M - 1)
# Find the max of the possible path totals
mat[i][j] += max(mat[i - 1][k] for k in range(min_k, max_k + 1))
# Only sample from items in the bottom row which could be paths from `h`
return max(mat[-1][max(h - N, 0):min(h + N, M - 1) + 1])
sample = [[110, 111, 108, 1],
[9, 8, 7, 2],
[4, 5, 10, 300],
[1, 2, 3, 4]]
print(find_max_path_w_start(sample, 0))
It's easy to build a bottom up solution here. Start thinking the case when there's only one or two rows, and extend it to understand this algorithm easily.
Note: this modifies the original matrix instead of creating a new one. If you need to run the function multiple times on the same matrix, you'll need to create a copy of the matrix to do the same.
def find_max_path_w_start(mat, h):
res = mat[0][0]
M = len(mat[0])
N = len((mat))
# build solution bottom up
for i in range(N-2,-1,-1):
for j in range(M):
possible_values = [mat[i+1][j]]
if j==0:
possible_values.append(mat[i+1][j+1])
elif j==M-1:
possible_values.append(mat[i+1][j-1])
else:
possible_values.append(mat[i+1][j+1])
possible_values.append(mat[i+1][j-1])
mat[i][j] += max(possible_values)
return mat[0][h]
sample = [[110, 111, 108, 1],
[9, 8, 7, 2],
[4, 5, 10, 300],
[1, 2, 3, 4]]
print(find_max_path_w_start(sample, 0)) # prints 132

Why have the smaller array drive binary search when computing the median of two sorted arrays?

A common algorithm for solving the problem of finding the median of two sorted arrays of size m and n is to:
Run binary search to adjust "a cut" of the smaller array in two halves. When doing so, we adjust the cut of the larger array to make sure the total number of elements on the first halves of both arrays equals the total number of elements in the second halves of both arrays, which is a pre-condition for splitting both arrays around the median.
The binary search shifts the cuts left or right until all elements on the left halves <= all elements on the right halves.
At the end of the procedure, we can readily compute the median with a basic comparison of the elements on the boundary of the cuts of both arrays.
While I understand at a high level the algorithm, I'm not sure I understand why one needs to do the calculation on the smaller array, and adjust the larger array, as opposed to the other way around.
Here's a video explaining the algorithm, but the author doesn't explain exactly why we use the smaller array to drive the binary search.
I'm also including below Python code that is supposed to solve the problem, mostly to make the post self-contained, even if it's not well documented.
def median(A, B):
m, n = len(A), len(B)
if m > n:
## Making sure that A refers to the smaller array
A, B, m, n = B, A, n, m
if n == 0:
raise ValueError
imin, imax, half_len = 0, m, (m + n + 1) / 2
while imin <= imax:
i = (imin + imax) / 2
j = half_len - i
if i < m and B[j-1] > A[i]:
# i is too small, must increase it
imin = i + 1
elif i > 0 and A[i-1] > B[j]:
# i is too big, must decrease it
imax = i - 1
else:
# i is perfect
if i == 0: max_of_left = B[j-1]
elif j == 0: max_of_left = A[i-1]
else: max_of_left = max(A[i-1], B[j-1])
if (m + n) % 2 == 1:
return max_of_left
if i == m: min_of_right = B[j]
elif j == n: min_of_right = A[i]
else: min_of_right = min(A[i], B[j])
return (max_of_left + min_of_right) / 2.0
By enforcing m <= n, we make sure both i and j are always non-negative.
Also, we are able to reduce some redundant boundary checks in the while loop when working with i and j.
Take the first if condition in the while loop as an example, the code checks for i < m before accessing A[i], but why wouldn't it also check for j-1 >= 0 before accessing B[j-1]? This is because i falls into [0, m], and j = (m + n + 1) / 2 - i, so when i is the largest, j is the smallest.
When i < m, j = (m + n + 1)/2 - i > (m + n + 1)/2 - m = n/2 - m/2 + 1/2 >= 0. So j must be positive when i < m, and j - 1 >= 0.
Similarly, in the second if condition in the while loop, when i > 0, j is guaranteed to be less than n.
To verify this idea, you can try removing the size check and swap logic at the top, and run through below example input, in which A is longer than B.
[1,2,3,4,6]
[5]

Find subset with K elements that are closest to eachother

Given an array of integers size N, how can you efficiently find a subset of size K with elements that are closest to each other?
Let the closeness for a subset (x1,x2,x3,..xk) be defined as:
2 <= N <= 10^5
2 <= K <= N
constraints: Array may contain duplicates and is not guaranteed to be sorted.
My brute force solution is very slow for large N, and it doesn't check if there's more than 1 solution:
N = input()
K = input()
assert 2 <= N <= 10**5
assert 2 <= K <= N
a = []
for i in xrange(0, N):
a.append(input())
a.sort()
minimum = sys.maxint
startindex = 0
for i in xrange(0,N-K+1):
last = i + K
tmp = 0
for j in xrange(i, last):
for l in xrange(j+1, last):
tmp += abs(a[j]-a[l])
if(tmp > minimum):
break
if(tmp < minimum):
minimum = tmp
startindex = i #end index = startindex + K?
Examples:
N = 7
K = 3
array = [10,100,300,200,1000,20,30]
result = [10,20,30]
N = 10
K = 4
array = [1,2,3,4,10,20,30,40,100,200]
result = [1,2,3,4]
Your current solution is O(NK^2) (assuming K > log N). With some analysis, I believe you can reduce this to O(NK).
The closest set of size K will consist of elements that are adjacent in the sorted list. You essentially have to first sort the array, so the subsequent analysis will assume that each sequence of K numbers is sorted, which allows the double sum to be simplified.
Assuming that the array is sorted such that x[j] >= x[i] when j > i, we can rewrite your closeness metric to eliminate the absolute value:
Next we rewrite your notation into a double summation with simple bounds:
Notice that we can rewrite the inner distance between x[i] and x[j] as a third summation:
where I've used d[l] to simplify the notation going forward:
Notice that d[l] is the distance between each adjacent element in the list. Look at the structure of the inner two summations for a fixed i:
j=i+1 d[i]
j=i+2 d[i] + d[i+1]
j=i+3 d[i] + d[i+1] + d[i+2]
...
j=K=i+(K-i) d[i] + d[i+1] + d[i+2] + ... + d[K-1]
Notice the triangular structure of the inner two summations. This allows us to rewrite the inner two summations as a single summation in terms of the distances of adjacent terms:
total: (K-i)*d[i] + (K-i-1)*d[i+1] + ... + 2*d[K-2] + 1*d[K-1]
which reduces the total sum to:
Now we can look at the structure of this double summation:
i=1 (K-1)*d[1] + (K-2)*d[2] + (K-3)*d[3] + ... + 2*d[K-2] + d[K-1]
i=2 (K-2)*d[2] + (K-3)*d[3] + ... + 2*d[K-2] + d[K-1]
i=3 (K-3)*d[3] + ... + 2*d[K-2] + d[K-1]
...
i=K-2 2*d[K-2] + d[K-1]
i=K-1 d[K-1]
Again, notice the triangular pattern. The total sum then becomes:
1*(K-1)*d[1] + 2*(K-2)*d[2] + 3*(K-3)*d[3] + ... + (K-2)*2*d[K-2]
+ (K-1)*1*d[K-1]
Or, written as a single summation:
This compact single summation of adjacent differences is the basis for a more efficient algorithm:
Sort the array, order O(N log N)
Compute the differences of each adjacent element, order O(N)
Iterate over each N-K sequence of differences and calculate the above sum, order O(NK)
Note that the second and third step could be combined, although with Python your mileage may vary.
The code:
def closeness(diff,K):
acc = 0.0
for (i,v) in enumerate(diff):
acc += (i+1)*(K-(i+1))*v
return acc
def closest(a,K):
a.sort()
N = len(a)
diff = [ a[i+1] - a[i] for i in xrange(N-1) ]
min_ind = 0
min_val = closeness(diff[0:K-1],K)
for ind in xrange(1,N-K+1):
cl = closeness(diff[ind:ind+K-1],K)
if cl < min_val:
min_ind = ind
min_val = cl
return a[min_ind:min_ind+K]
itertools to the rescue?
from itertools import combinations
def closest_elements(iterable, K):
N = set(iterable)
assert(2 <= K <= len(N) <= 10**5)
combs = lambda it, k: combinations(it, k)
_abs = lambda it: abs(it[0] - it[1])
d = {}
v = 0
for x in combs(N, K):
for y in combs(x, 2):
v += _abs(y)
d[x] = v
v = 0
return min(d, key=d.get)
>>> a = [10,100,300,200,1000,20,30]
>>> b = [1,2,3,4,10,20,30,40,100,200]
>>> print closest_elements(a, 3); closest_elements(b, 4)
(10, 20, 30) (1, 2, 3, 4)
This procedure can be done with O(N*K) if A is sorted. If A is not sorted, then the time will be bounded by the sorting procedure.
This is based on 2 facts (relevant only when A is ordered):
The closest subsets will always be subsequent
When calculating the closeness of K subsequent elements, the sum of distances can be calculated as the sum of each two subsequent elements time (K-i)*i where i is 1,...,K-1.
When iterating through the sorted array, it is redundant to recompute the entire sum, we can instead remove K times the distance between the previously two smallest elements, and add K times the distance of the two new largest elements. this fact is being used to calculate the closeness of a subset in O(1) by using the closeness of the previous subset.
Here's the pseudo-code
List<pair> FindClosestSubsets(int[] A, int K)
{
List<pair> minList = new List<pair>;
int minVal = infinity;
int tempSum;
int N = A.length;
for (int i = K - 1; i < N; i++)
{
tempSum = 0;
for (int j = i - K + 1; j <= i; j++)
tempSum += (K-i)*i * (A[i] - A[i-1]);
if (tempSum < minVal)
{
minVal = tempSum;
minList.clear();
minList.add(new pair(i-K, i);
}
else if (tempSum == minVal)
minList.add(new pair(i-K, i);
}
return minList;
}
This function will return a list of pairs of indexes representing the optimal solutions (the starting and ending index of each solution), it was implied in the question that you want to return all solutions of the minimal value.
try the following:
N = input()
K = input()
assert 2 <= N <= 10**5
assert 2 <= K <= N
a = some_unsorted_list
a.sort()
cur_diff = sum([abs(a[i] - a[i + 1]) for i in range(K - 1)])
min_diff = cur_diff
min_last_idx = K - 1
for last_idx in range(K,N):
cur_diff = cur_diff - \
abs(a[last_idx - K - 1] - a[last_idx - K] + \
abs(a[last_idx] - a[last_idx - 1])
if min_diff > cur_diff:
min_diff = cur_diff
min_last_idx = last_idx
From the min_last_idx, you can calculate the min_first_idx. I use range to preserve the order of idx. If this is python 2.7, it will take linearly more RAM. This is the same algorithm that you use, but slightly more efficient (smaller constant in complexity), as it does less then summing all.
After sorting, we can be sure that, if x1, x2, ... xk are the solution, then x1, x2, ... xk are contiguous elements, right?
So,
take the intervals between numbers
sum these intervals to get the intervals between k numbers
Choose the smallest of them
My initial solution was to look through all the K element window and multiply each element by m and take the sum in that range, where m is initialized by -(K-1) and incremented by 2 in each step and take the minimum sum from the entire list. So for a window of size 3, m is -2 and the values for the range will be -2 0 2. This is because I observed a property that each element in the K window add a certain weight to the sum. For an example if the elements are [10 20 30] the sum is (30-10) + (30-20) + (20-10). So if we break down the expression we have 2*30 + 0*20 + (-2)*10. This can be achieved in O(n) time and the entire operation would be in O(NK) time. However it turns out that this solution is not optimal, and there are certain edge cases where this algorithm fails. I am yet to figure out those cases, but shared the solution anyway if anyone can figure out something useful from it.
for(i = 0 ;i <= n - k;++i)
{
diff = 0;
l = -(k-1);
for(j = i;j < i + k;++j)
{
diff += a[j]*l;
if(min < diff)
break;
l += 2;
}
if(j == i + k && diff > 0)
min = diff;
}
You can do this is O(n log n) time with a sliding window approach (O(n) if the array is already sorted).
First, suppose we've precomputed, at every index i in our array, the sum of distances from A[i] to the previous k-1 elements. The formula for that would be
(A[i] - A[i-1]) + (A[i] - A[i-2]) + ... + (A[i] - A[i-k+1]).
If i is less than k-1, we just compute the sum to the array boundary.
Suppose we also precompute, at every index i in our array, the sum of distances from A[i] to the next k-1 elements. Then we could solve the whole problem with a single pass of a sliding window.
If our sliding window is on [L, L+k-1] with closeness sum S, then the closeness sum for the interval [L+1, L+k] is just S - dist_sum_to_next[L] + dist_sum_to_prev[L+k]. The only changes in the sum of pairwise distances are removing all terms involving A[L] when it leaves our window, and adding all terms involving A[L+k] as it enters our window.
The only remaining part is how to compute, at a position i, the sum of distances between A[i] and the previous k-1 elements (the other computation is totally symmetric). If we know the distance sum at i-1, this is easy: subtract the distance from A[i-1] to A[i-k], and add in the extra distance from A[i-1] to A[i] k-1 times
dist_sum_to_prev[i] = (dist_sum_to_prev[i - 1] - (A[i - 1] - A[i - k])
+ (A[i] - A[i - 1]) * (k - 1)
Python code:
def closest_subset(nums: List[int], k: int) -> List[int]:
"""Given a list of n (poss. unsorted and non-unique) integers nums,
returns a (sorted) list of size k that minimizes the sum of pairwise
distances between all elements in the list.
Runs in O(n lg n) time, uses O(n) auxiliary space.
"""
n = len(nums)
assert len(nums) == n
assert 2 <= k <= n
nums.sort()
# Sum of pairwise distances to the next (at most) k-1 elements
dist_sum_to_next = [0] * n
# Sum of pairwise distances to the last (at most) k-1 elements
dist_sum_to_prev = [0] * n
for i in range(1, n):
if i >= k:
dist_sum_to_prev[i] = ((dist_sum_to_prev[i - 1] -
(nums[i - 1] - nums[i - k]))
+ (nums[i] - nums[i - 1]) * (k - 1))
else:
dist_sum_to_prev[i] = (dist_sum_to_prev[i - 1]
+ (nums[i] - nums[i - 1]) * i)
for i in reversed(range(n - 1)):
if i < n - k:
dist_sum_to_next[i] = ((dist_sum_to_next[i + 1]
- (nums[i + k] - nums[i + 1]))
+ (nums[i + 1] - nums[i]) * (k - 1))
else:
dist_sum_to_next[i] = (dist_sum_to_next[i + 1]
+ (nums[i + 1] - nums[i]) * (n-i-1))
best_sum = math.inf
curr_sum = 0
answer_right_bound = 0
for i in range(n):
curr_sum += dist_sum_to_prev[i]
if i >= k:
curr_sum -= dist_sum_to_next[i - k]
if curr_sum < best_sum and i >= k - 1:
best_sum = curr_sum
answer_right_bound = i
return nums[answer_right_bound - k + 1:answer_right_bound + 1]

Categories

Resources