I have a model for four possibilities of purchasing a pair items (purchasing both, none or just one) and need to optimize the (pseudo-) log-likelihood function. Part of this, of course, is the calculation/definition of the pseudo-log-likelihood function.
The following is my code, where Beta is a 2-d vector for each customer (there are U customers and U different beta vectors), X is a 2-d vector for each item (different for each of the N items) and Gamma is a symmetric matrix with a scalar value gamma(i,j) for each pair of items. And df is a dataframe of the purchases - one row for each customer and N columns for the items.
It would seem to me that all of these loops are inefficient and take up too much time, but I am not sure how to speed up this calculation and would appreciate any help improving it.
Thank you in advance!
def pseudo_likelihood(Args):
Beta = np.reshape(Args[0:2*U], (U, 2))
Gamma = np.reshape(Args[2*U:], (N,N))
L = 0
for u in range(0,U,1):
print datetime.datetime.today(), " for user {}".format(u)
y = df.loc[u][1:]
beta_u = Beta[u,:]
for l in range(N):
print datetime.datetime.today(), " for item {}".format(l)
for i in range(N-1):
if i == l:
continue
for j in range(i+1,N):
if (y[i] == y[j]):
if (y[i] == 1):
L += np.dot(beta_u,(x_vals.iloc[i,1:]+x_vals.iloc[j,1:])) + Gamma[i,j] #Log of the exponent of this expression
else:
L += np.log(
1 - np.exp(np.dot(beta_u, (x_vals.iloc[i, 1:] + x_vals.iloc[j, 1:])) + Gamma[i, j])
- np.exp(np.dot(beta_u, x_vals.iloc[i, 1:])) * (
1 - np.exp(np.dot(beta_u, x_vals.iloc[j, 1:])))
- np.exp(np.dot(beta_u, x_vals.iloc[j, 1:])) * (
1 - np.exp(np.dot(beta_u, x_vals.iloc[i, 1:]))))
else:
if (y[i] == 1):
L += np.dot(beta_u,x_vals.iloc[i,1:]) + np.log(1 - np.exp(np.dot(beta_u,x_vals.iloc[j,1:])))
else:
L += (np.dot(beta_u, x_vals.iloc[j,1:])) + np.log(1 - np.exp(np.dot(beta_u, x_vals.iloc[i,1:])))
L -= (N-2)*np.dot(beta_u,x_vals.iloc[l,1:])
for k in range(N):
if k != l:
L -= np.dot(beta_u, x_vals.iloc[k,1:])
return -L
To add/clarify - I am using this calculation to optimize and find the beta and gamma parameters that generated the data for this pseudo-likelihood function.
I am using scipy optimize.minimize with the 'Powell' method.
Updating for whomever is interested-
I found numpy.einsum to speed up the calculations here by over 90%.
np.einsum performs matrix/vector operations using Einstein notation. Recall that for two matrices A, B their product can be represented as the sum of
a_ij*b_jk
i.e. the ik element of the matrix AB is the sum over j of a_ij*b_jk
Using the einsum function I could calculate in advance all of the values necessary for the iterative calculation, saving precious time and hundreds, if not thousands, of unnecessary calculations.
I rewrote the code as follows:
def pseudo_likelihood(Args):
Beta = np.reshape(Args[0:2*U], (U,2))
Gamma = np.reshape(Args[2*U:], (N,N))
exp_gamma = np.exp(Gamma)
L = 0
for u in xrange(U):
y = df.loc[u][1:]
beta_u = Beta[u,:]
beta_dot_x = np.einsum('ij,j',x_vals[['V1','V2']],beta_u)
exp_beta_dot_x = np.exp(beta_dot_x)
log_one_minus_exp = np.log(1 - exp_beta_dot_x)
for l in xrange(N):
for i in xrange(N-1):
if i == l:
continue
for j in xrange(i+1,N):
if (y[i] == y[j]):
if (y[i] == 1):
L += beta_dot_x[i] + beta_dot_x[j] + Gamma[i,j] #Log of the exponent of this expression
else:
L += math.log(
1 - exp_beta_dot_x[i]*exp_beta_dot_x[j]*exp_gamma[i,j]
- exp_beta_dot_x[i] * (1 - exp_beta_dot_x[j])
- exp_beta_dot_x[j] * (1 - exp_beta_dot_x[i]))
else:
if (y[i] == 1):
L += beta_dot_x[i] + log_one_minus_exp[j]
else:
L += (beta_dot_x[j]) + log_one_minus_exp[i]
L -= (N-2)*beta_dot_x[l]
for k in xrange(N):
if k != l:
L -= sum(beta_dot_x) + beta_dot_x[l]
return -L
Related
I want to write the first part of the Smith-Waterman algorithm in python with basic functions.
I found this example, but it doesn't give me what I'm looking for.
def zeros(X: int, Y: int):
# ^ ^ incorrect type annotations. should be str
lenX = len(X) + 1
lenY = len(Y) + 1
matrix = []
for i in range(lenX):
matrix.append([0] * lenY)
# A more "pythonic" way of expressing the above would be:
# matrix = [[0] * len(Y) + 1 for _ in range(len(x) + 1)]
def score(X, Y):
# ^ ^ shadowing variables from outer scope. this is not a bug per se but it's considered bad practice
if X[n] == Y[m]: return 4
# ^ ^ variables not defined in scope
if X[n] == '-' or Y[m] == '-': return -4
# ^ ^ variables not defined in scope
else: return -2
def SmithWaterman(X, Y, score): # this function is never called
# ^ unnecessary function passed as parameter. function is defined in scope
for n in range(1, len(X) + 1):
for m in range(1, len(Y) + 1):
align = matrix[n-1, m-1] + (score(X[n-1], Y[m-1]))
# ^ invalid list lookup. should be: matrix[n-1][m-1]
indelX = matrix[n-1, m] + (score(X[n-1], Y[m]))
# ^ out of bounds error when m == len(Y)
indelY = matrix[n, m-1] + (score(X[n], Y[m-1]))
# ^ out of bounds error when n == len(X)
matrix[n, m] = max(align, indelX, indelY, 0)
# this should be nested in the inner for-loop. m, n, indelX, and indelY are not defined in scope here
print(matrix)
zeros("ACGT", "ACGT")
On a book I found this algorithm, but I couldn't implement it correctly.
input: sequences s and t, with |s| =n, |t| = m, score function, penality InDel
match +1, mismatch -2, InDel -1
M = matrix of size n+1 * m+1
M[i,j] = 0
i=j=0
Any help please
Thanks
The problems with the code you presented are well described in the comments in that piece of code.
Assuming that you want a linear gap-penalty of 2 points, and you are looking for the first phase algorithm only (so excluding the trace-back process), the code can be fixed as follows:
def score(x, y):
return 4 if x == y else (
-4 if '-' in (x, y) else -2
)
def zeros(a, b):
penalty = 2 # linear penalty (see Wikipedia)
nextrow = [0] * (len(b) + 1)
matrix = [nextrow]
for valA in a:
row, nextrow = nextrow, [0]
for m, valB in enumerate(b):
nextrow.append(max(
row[m] + score(valA, valB),
row[m+1] - penalty,
nextrow[m] - penalty,
0
))
matrix.append(nextrow)
return matrix
# Example run:
result = zeros("ACGT", "AC-GT")
print(result)
the image algorithm implementation:
M = []
for i in range(n):
M.append([])
for j in range(m):
first = max(M[i - 1][j - 1] + score(s[i], t[j])
second = M[i - 1][j] + penal
third = M[i][j - 1] + penal
M[i].append(first, second, third, 0))
But you will have to fix the edge cases (out of range) and add some default values.
The following code prints the pythagorean triplet if it is equal to the input, but the problem is that it takes a long time for large numbers like 90,000 to answer.
What can I do to optimize the following code?
1 ≤ n ≤ 90 000
def pythagoreanTriplet(n):
# Considering triplets in
# sorted order. The value
# of first element in sorted
# triplet can be at-most n/3.
for i in range(1, int(n / 3) + 1):
# The value of second element
# must be less than equal to n/2
for j in range(i + 1,
int(n / 2) + 1):
k = n - i - j
if (i * i + j * j == k * k):
print(i, ", ", j, ", ",
k, sep="")
return
print("Impossible")
# Driver Code
vorodi = int(input())
pythagoreanTriplet(vorodi)
Your source code does a brute force search for a solution so it's slow.
Faster Code
def solve_pythagorean_triplets(n):
" Solves for triplets whose sum equals n "
solutions = []
for a in range(1, n):
denom = 2*(n-a)
num = 2*a**2 + n**2 - 2*n*a
if denom > 0 and num % denom == 0:
c = num // denom
b = n - a - c
if b > a:
solutions.append((a, b, c))
return solutions
OP code
Modified OP code so it returns all solutions rather than printing the first found to compare performance
def pythagoreanTriplet(n):
# Considering triplets in
# sorted order. The value
# of first element in sorted
# triplet can be at-most n/3.
results = []
for i in range(1, int(n / 3) + 1):
# The value of second element
# must be less than equal to n/2
for j in range(i + 1,
int(n / 2) + 1):
k = n - i - j
if (i * i + j * j == k * k):
results.append((i, j, k))
return results
Timing
n pythagoreanTriplet (OP Code) solve_pythagorean_triplets (new)
900 0.084 seconds 0.039 seconds
5000 3.130 seconds 0.012 seconds
90000 Timed out after several minutes 0.430 seconds
Explanation
Function solve_pythagorean_triplets is O(n) algorithm that works as follows.
Searching for:
a^2 + b^2 = c^2 (triplet)
a + b + c = n (sum equals input)
Solve by searching over a (i.e. a fixed for an iteration). With a fixed, we have two equations and two unknowns (b, c):
b + c = n - a
c^2 - b^2 = a^2
Solution is:
denom = 2*(n-a)
num = 2*a**2 + n**2 - 2*n*a
if denom > 0 and num % denom == 0:
c = num // denom
b = n - a - c
if b > a:
(a, b, c) # is a solution
Iterate a range(1, n) to get different solutions
Edit June 2022 by #AbhijitSarkar:
For those who like to see the missing steps:
c^2 - b^2 = a^2
b + c = n - a
=> b = n - a - c
c^2 - (n - a - c)^2 = a^2
=> c^2 - (n - a - c) * (n - a - c) = a^2
=> c^2 - n(n - a - c) + a(n - a - c) + c(n - a - c) = a^2
=> c^2 - n^2 + an + nc + an - a^2 - ac + cn - ac - c^2 = a^2
=> -n^2 + 2an + 2nc - a^2 - 2ac = a^2
=> -n^2 + 2an + 2nc - 2a^2 - 2ac = 0
=> 2c(n - a) = n^2 - 2an + 2a^2
=> c = (n^2 - 2an + 2a^2) / 2(n - a)
DarrylG's answer is correct, and I've added the missing steps to it as well, but there's another solution that's faster than iterating from [1, n). Let me explain it, but I'll leave the code up to the reader.
We use Euclid's formula of generating a tuple.
a = m^2 - n^2, b = 2mn, c = m^2 + n^2, where m > n > 0 ---(i)
a + b + c = P ---(ii)
Combining equations (i) and (ii), we have:
2m^2 + 2mn = P ---(iii)
Since m > n > 0, 1 <= n <= m - 1.
Putting n=1 in equation (iii), we have:
2m^2 + 2m - P = 0, ax^2 + bx + c = 0, a=2, b=2, c=-P
m = (-b +- sqrt(b^2 - 4ac)) / 2a
=> (-2 +- sqrt(4 + 8P)) / 4
=> (-1 +- sqrt(1 + 2P)) / 2
Since m > 0, sqrt(b^2 - 4ac) > -b, the only solution is
(-1 + sqrt(1 + 2P)) / 2 ---(iv)
Putting n=m-1 in equation (iii), we have:
2m^2 + 2m(m - 1) - P = 0
=> 4m^2 - 2m - P = 0, ax^2 + bx + c = 0, a=4, b=-2, c=-P
m = (-b +- sqrt(b^2 - 4ac)) / 2a
=> (2 +- sqrt(4 + 16P)) / 8
=> (1 +- sqrt(1 + 4P)) / 4
Since m > 0, the only solution is
(1 + sqrt(1 + 4P)) / 4 ---(v)
From equation (iii), m^2 + mn = P/2; since P/2 is constant,
when n is the smallest, m must be the largest, and vice versa.
Thus:
(1 + sqrt(1 + 4P)) / 4 <= m <= (-1 + sqrt(1 + 2P)) / 2 ---(vi)
Solving equation (iii) for n, we have:
n = (P - 2m^2) / 2m ---(vii)
We iterate for m within the bounds given by the inequality (vi)
and check when the corresponding n given by equation (vii) is
an integer.
Despite generating all primitive triples, Euclid's formula does not
produce all triples - for example, (9, 12, 15) cannot be generated using
integer m and n. This can be remedied by inserting an additional
parameter k to the formula. The following will generate all Pythagorean
triples uniquely.
a = k(m^2 - n^2), b = 2kmn, c = k(m^2 + n^2), for k >= 1.
Thus, we iterate for integer values of P/k until P < 12,
lowest possible perimeter corresponding to the triple (3, 4, 5).
Yo
I don't know if you still need the answer or not but hopefully, this can help.
n = int(input())
ans = [(a, b, c) for a in range(1, n) for b in range(a, n) for c in range(b, n) if (a**2 + b**2 == c**2 and a + b + c == n)]
if ans:
print(ans[0][0], ans[0][1], ans[0][2])
else:
print("Impossible")
I am coding in python, using numpy. I want to optimize a formula that looks like that, I use a picture for the sake of readibility.
In that example, the times t are from different lists, indicated by the superscript. The corresponding vector here is T_t, which is a list of list.
Here is my original code:
def first_version(m, n, k, T_t, BETA):
if k == 1:
return 0
ans = 0
for i in range(len(T_t[n])):
if T_t[n][i] < T_t[m][k - 1]:
ans += (T_t[m][k - 1] - T_t[n][i]) * np.exp(-BETA[m, n] * (T_t[m][k - 1] - T_t[n][i]))
else:
break
return ans
The break at the end allows me to spare some time. I had that brilliant idea of using the numpy library to improve the performances:
def second_version(m, n, k, T_t, BETA):
if k == 1:
return 0
the_times = np.maximum( T_t[m][k - 1] - np.array(T_t[n]) , 0 )
ans = sum(the_times * np.exp( -BETA[m, n] * the_times ))
return ans
For the sake of comparison, the second algorithm runs 100x faster. Is it possible to do better ? In particular, I regret the fact that numpy computes the maximum over the whole vector when probably half of it will be 0 at the end.
Do you have any idea how to improve those bits of code ?
I forgot a sum in the code nr 2. That slows down the code and make it only 20 times faster.
I have 2 main suggestions:
Using np.sum() instead of sum() about triples the speed of second_version
Using numba.jit increases the speed again about 8x. (Actually you can jit-compile either version and end up with about the same speed)
Full code example:
import numpy as np
import numba
import timeit
def first_version(m, n, k, T_t, BETA):
if k == 1:
return 0
ans = 0
for i in range(len(T_t[n])):
if T_t[n][i] < T_t[m][k - 1]:
ans += (T_t[m][k - 1] - T_t[n][i]) * np.exp(-BETA[m, n] * (T_t[m][k - 1] - T_t[n][i]))
else:
break
return ans
def second_version(m, n, k, T_t, BETA):
if k == 1:
return 0
the_times = np.maximum( T_t[m][k - 1] - np.array(T_t[n]) , 0 )
ans = np.sum(the_times * np.exp( -BETA[m, n] * the_times ))
return ans
def jit_version(m, n, k, T_t, BETA):
# wrapper makes it to that numba doesn't have to deal with
# the list-of-arrays data type
return jit_version_core(k, T_t[m], T_t[n], BETA[m, n])
#numba.jit(nopython=True)
def jit_version_core(k, t1, t2, b):
if k == 1:
return 0
ans = 0
for i in range(len(t2)):
if t2[i] < t1[k - 1]:
ans += (t1[k - 1] - t2[i]) * np.exp(-b * (t1[k - 1] - t2[i]))
else:
break
return ans
N = 10000
t1 = np.cumsum(np.random.random(size=N))
t2 = np.cumsum(np.random.random(size=N))
beta = np.random.random(size=(2, 2))
for fn in ['first_version', 'second_version', 'jit_version']:
print("------", fn)
v = globals()[fn](0, 1, len(t1), [t1, t2], beta)
t = timeit.timeit('%s(0, 1, len(t1), [t1, t2], beta)' % fn, number=100, globals=globals())
print("output:", v, "time:", t)
And the output:
------ first_version
output: 3.302938986817431 time: 2.900316455983557
------ second_version
output: 3.3029389868174306 time: 0.12064526398899034
------ jit_version
output: 3.302938986817431 time: 0.013476221996825188
I've been assigned a project in my computing class to do a report on some area of mathematics in LaTeX, using Python 2.7 code - I chose the Fibonacci sequence.
As part of my project I wanted to include a plot of the Fibonacci 'spiral' which is actually comprised of a series of quarter-circles of increasing radii. As such, I've tried to define a function to give a loop that returns the centres of these quarter-circles so I can create a plot. Using pen and paper I have found the centres of each quarter-circle and noticed that with each new quarter-circle there's an exchange of coordinates - ie. if n is even, the x-coordinate of the previous centre remains the x-coordinate for the nth centre; similarly, when n is odd, the y-coordinate remains the same.
My problem arises with the other coordinate. They work on an alternating pattern of + or - the (n-2)th Fibonacci number to the y-coordinate (for even n) or x-coordinate (for odd) of the previous centre.
I've created the following loop in SageMathCloud, but I think I've deduced that my counters aren't incrementing when I wanted them to:
def centrecoords(n):
k = 0
l = 1
if fib(n) == 1:
return tuple((0,-1))
elif n % 2 == 0 and k % 2 == 0:
return tuple((centrecoords(n-1)[0], centrecoords(n-1)[1] + ((-1) ** k) * fib(n - 2)))
k += 1
elif n % 2 == 0:
return tuple((centrecoords(n-1)[0], centrecoords(n-1)[1] + ((-1) ** k) * fib(n - 2)))
elif n % 2 != 0 and l % 2 == 0:
return tuple((centrecoords(n-1)[0] + ((-1) ** l) * fib(n - 2), centrecoords(n-1)[1]))
l += 1
else:
return tuple((centrecoords(n-1)[0] + ((-1) ** l) * fib(n - 2), centrecoords(n-1)[1]))
cen_coords = []
for i in range(0, 21):
cen_coords.append(centrecoords(i))
cen_coords
Any help in making the k counter increment with its if statement only, and the same with the l counter would be greatly appreciated.
Your problem is that k and l are local variables. As such they are lost every time the function exits, and re-start at zero and one respectively when is called again (yes, even when it's called from itself).
Nick's code aims to store a single instance each of k and l in the top-level function, sharing them with the recursive calls.
Another reasonable approach might be to rewrite your recursion as a loop, and yield the sequence. This makes it trivial to keep the state of k and l, as your locals are preserved.
Or, you could re-write your function as a class method, and make k and l instance variables. This behaves similarly, with the instance storing your intermediate state between calls to centrecoords.
Apart from all of these, your code looks like it requires each call to centrecoords to receive the next value of n. So, even if you fix the state problem, this is a poor design.
I'd suggest going the generator route, and taking a single argument, the maximum value of n. Then you can iterate over range(n), yielding each result in turn. Note also that your only recursive call is for n-1, which is just your previous iteration, so you can simply remember it.
Quick demo: I haven't tested this, or checked the corner cases ...
def fib(n):
if n < 2:
return 1
return fib(n-1) + fib(n-2)
def centrecoords(max_n):
# initial values
k = 0
l = 1
result=(0,-1)
# note fib(0) == fib(1) == 1
for n in range(2,max_n):
if n % 2 == 0:
result = (result[0], result[1] + ((-1) ** k) * fib(n - 2))
yield result
if k % 2 == 0:
k += 1
else:
result = (result[0] + ((-1) ** l) * fib(n - 2), result[1])
yield result
if l % 2 == 0:
l += 1
cen_coords = list(centrecoords(21))
Expanding on my comment. Your code could look something like the one below. But please not that you might need to adjust starting values of k and l to -1 and 0 correspondingly, because k and l are incremented before recursion calls (opposite to your code which implied that first a recursion is called and only then k and l are increased).
I also deleted tuple, it is unnecessary in python and hard to read, to create a tuple use comma syntax, e.g.: 1, 2.
Also n == 0 (fib(n) == 0) should be considered as special case, or you program will enter infinite recursion and crash when centrecoords called with n=0.
I have no account on SageMathCloud to test it, but it at least should fix counters increment.
def centrecoords(n, k=0, l=1):
if n == 0:
return 0, 0 # this is pure guess and most likely incorrect, but n == 0 (or fib(n) == 0 should be handled separatly)
if fib(n) == 1:
return 0, -1
elif n % 2 == 0 and k % 2 == 0:
k += 1
return centrecoords(n-1, k, l)[0], centrecoords(n-1, k, l)[1] + ((-1) ** k) * fib(n - 2)
elif n % 2 == 0:
return centrecoords(n-1, k, l)[0], centrecoords(n-1, k, l)[1] + ((-1) ** k) * fib(n - 2)
elif n % 2 != 0 and l % 2 == 0:
l += 1
return centrecoords(n-1, k, l)[0] + ((-1) ** l) * fib(n - 2), centrecoords(n-1, k, l)[1]
else:
return centrecoords(n-1, k, l)[0] + ((-1) ** l) * fib(n - 2), centrecoords(n-1, k, l)[1]
cen_coords = []
for i in range(0, 21):
cen_coords.append(centrecoords(i))
cen_coords
Given an array of integers size N, how can you efficiently find a subset of size K with elements that are closest to each other?
Let the closeness for a subset (x1,x2,x3,..xk) be defined as:
2 <= N <= 10^5
2 <= K <= N
constraints: Array may contain duplicates and is not guaranteed to be sorted.
My brute force solution is very slow for large N, and it doesn't check if there's more than 1 solution:
N = input()
K = input()
assert 2 <= N <= 10**5
assert 2 <= K <= N
a = []
for i in xrange(0, N):
a.append(input())
a.sort()
minimum = sys.maxint
startindex = 0
for i in xrange(0,N-K+1):
last = i + K
tmp = 0
for j in xrange(i, last):
for l in xrange(j+1, last):
tmp += abs(a[j]-a[l])
if(tmp > minimum):
break
if(tmp < minimum):
minimum = tmp
startindex = i #end index = startindex + K?
Examples:
N = 7
K = 3
array = [10,100,300,200,1000,20,30]
result = [10,20,30]
N = 10
K = 4
array = [1,2,3,4,10,20,30,40,100,200]
result = [1,2,3,4]
Your current solution is O(NK^2) (assuming K > log N). With some analysis, I believe you can reduce this to O(NK).
The closest set of size K will consist of elements that are adjacent in the sorted list. You essentially have to first sort the array, so the subsequent analysis will assume that each sequence of K numbers is sorted, which allows the double sum to be simplified.
Assuming that the array is sorted such that x[j] >= x[i] when j > i, we can rewrite your closeness metric to eliminate the absolute value:
Next we rewrite your notation into a double summation with simple bounds:
Notice that we can rewrite the inner distance between x[i] and x[j] as a third summation:
where I've used d[l] to simplify the notation going forward:
Notice that d[l] is the distance between each adjacent element in the list. Look at the structure of the inner two summations for a fixed i:
j=i+1 d[i]
j=i+2 d[i] + d[i+1]
j=i+3 d[i] + d[i+1] + d[i+2]
...
j=K=i+(K-i) d[i] + d[i+1] + d[i+2] + ... + d[K-1]
Notice the triangular structure of the inner two summations. This allows us to rewrite the inner two summations as a single summation in terms of the distances of adjacent terms:
total: (K-i)*d[i] + (K-i-1)*d[i+1] + ... + 2*d[K-2] + 1*d[K-1]
which reduces the total sum to:
Now we can look at the structure of this double summation:
i=1 (K-1)*d[1] + (K-2)*d[2] + (K-3)*d[3] + ... + 2*d[K-2] + d[K-1]
i=2 (K-2)*d[2] + (K-3)*d[3] + ... + 2*d[K-2] + d[K-1]
i=3 (K-3)*d[3] + ... + 2*d[K-2] + d[K-1]
...
i=K-2 2*d[K-2] + d[K-1]
i=K-1 d[K-1]
Again, notice the triangular pattern. The total sum then becomes:
1*(K-1)*d[1] + 2*(K-2)*d[2] + 3*(K-3)*d[3] + ... + (K-2)*2*d[K-2]
+ (K-1)*1*d[K-1]
Or, written as a single summation:
This compact single summation of adjacent differences is the basis for a more efficient algorithm:
Sort the array, order O(N log N)
Compute the differences of each adjacent element, order O(N)
Iterate over each N-K sequence of differences and calculate the above sum, order O(NK)
Note that the second and third step could be combined, although with Python your mileage may vary.
The code:
def closeness(diff,K):
acc = 0.0
for (i,v) in enumerate(diff):
acc += (i+1)*(K-(i+1))*v
return acc
def closest(a,K):
a.sort()
N = len(a)
diff = [ a[i+1] - a[i] for i in xrange(N-1) ]
min_ind = 0
min_val = closeness(diff[0:K-1],K)
for ind in xrange(1,N-K+1):
cl = closeness(diff[ind:ind+K-1],K)
if cl < min_val:
min_ind = ind
min_val = cl
return a[min_ind:min_ind+K]
itertools to the rescue?
from itertools import combinations
def closest_elements(iterable, K):
N = set(iterable)
assert(2 <= K <= len(N) <= 10**5)
combs = lambda it, k: combinations(it, k)
_abs = lambda it: abs(it[0] - it[1])
d = {}
v = 0
for x in combs(N, K):
for y in combs(x, 2):
v += _abs(y)
d[x] = v
v = 0
return min(d, key=d.get)
>>> a = [10,100,300,200,1000,20,30]
>>> b = [1,2,3,4,10,20,30,40,100,200]
>>> print closest_elements(a, 3); closest_elements(b, 4)
(10, 20, 30) (1, 2, 3, 4)
This procedure can be done with O(N*K) if A is sorted. If A is not sorted, then the time will be bounded by the sorting procedure.
This is based on 2 facts (relevant only when A is ordered):
The closest subsets will always be subsequent
When calculating the closeness of K subsequent elements, the sum of distances can be calculated as the sum of each two subsequent elements time (K-i)*i where i is 1,...,K-1.
When iterating through the sorted array, it is redundant to recompute the entire sum, we can instead remove K times the distance between the previously two smallest elements, and add K times the distance of the two new largest elements. this fact is being used to calculate the closeness of a subset in O(1) by using the closeness of the previous subset.
Here's the pseudo-code
List<pair> FindClosestSubsets(int[] A, int K)
{
List<pair> minList = new List<pair>;
int minVal = infinity;
int tempSum;
int N = A.length;
for (int i = K - 1; i < N; i++)
{
tempSum = 0;
for (int j = i - K + 1; j <= i; j++)
tempSum += (K-i)*i * (A[i] - A[i-1]);
if (tempSum < minVal)
{
minVal = tempSum;
minList.clear();
minList.add(new pair(i-K, i);
}
else if (tempSum == minVal)
minList.add(new pair(i-K, i);
}
return minList;
}
This function will return a list of pairs of indexes representing the optimal solutions (the starting and ending index of each solution), it was implied in the question that you want to return all solutions of the minimal value.
try the following:
N = input()
K = input()
assert 2 <= N <= 10**5
assert 2 <= K <= N
a = some_unsorted_list
a.sort()
cur_diff = sum([abs(a[i] - a[i + 1]) for i in range(K - 1)])
min_diff = cur_diff
min_last_idx = K - 1
for last_idx in range(K,N):
cur_diff = cur_diff - \
abs(a[last_idx - K - 1] - a[last_idx - K] + \
abs(a[last_idx] - a[last_idx - 1])
if min_diff > cur_diff:
min_diff = cur_diff
min_last_idx = last_idx
From the min_last_idx, you can calculate the min_first_idx. I use range to preserve the order of idx. If this is python 2.7, it will take linearly more RAM. This is the same algorithm that you use, but slightly more efficient (smaller constant in complexity), as it does less then summing all.
After sorting, we can be sure that, if x1, x2, ... xk are the solution, then x1, x2, ... xk are contiguous elements, right?
So,
take the intervals between numbers
sum these intervals to get the intervals between k numbers
Choose the smallest of them
My initial solution was to look through all the K element window and multiply each element by m and take the sum in that range, where m is initialized by -(K-1) and incremented by 2 in each step and take the minimum sum from the entire list. So for a window of size 3, m is -2 and the values for the range will be -2 0 2. This is because I observed a property that each element in the K window add a certain weight to the sum. For an example if the elements are [10 20 30] the sum is (30-10) + (30-20) + (20-10). So if we break down the expression we have 2*30 + 0*20 + (-2)*10. This can be achieved in O(n) time and the entire operation would be in O(NK) time. However it turns out that this solution is not optimal, and there are certain edge cases where this algorithm fails. I am yet to figure out those cases, but shared the solution anyway if anyone can figure out something useful from it.
for(i = 0 ;i <= n - k;++i)
{
diff = 0;
l = -(k-1);
for(j = i;j < i + k;++j)
{
diff += a[j]*l;
if(min < diff)
break;
l += 2;
}
if(j == i + k && diff > 0)
min = diff;
}
You can do this is O(n log n) time with a sliding window approach (O(n) if the array is already sorted).
First, suppose we've precomputed, at every index i in our array, the sum of distances from A[i] to the previous k-1 elements. The formula for that would be
(A[i] - A[i-1]) + (A[i] - A[i-2]) + ... + (A[i] - A[i-k+1]).
If i is less than k-1, we just compute the sum to the array boundary.
Suppose we also precompute, at every index i in our array, the sum of distances from A[i] to the next k-1 elements. Then we could solve the whole problem with a single pass of a sliding window.
If our sliding window is on [L, L+k-1] with closeness sum S, then the closeness sum for the interval [L+1, L+k] is just S - dist_sum_to_next[L] + dist_sum_to_prev[L+k]. The only changes in the sum of pairwise distances are removing all terms involving A[L] when it leaves our window, and adding all terms involving A[L+k] as it enters our window.
The only remaining part is how to compute, at a position i, the sum of distances between A[i] and the previous k-1 elements (the other computation is totally symmetric). If we know the distance sum at i-1, this is easy: subtract the distance from A[i-1] to A[i-k], and add in the extra distance from A[i-1] to A[i] k-1 times
dist_sum_to_prev[i] = (dist_sum_to_prev[i - 1] - (A[i - 1] - A[i - k])
+ (A[i] - A[i - 1]) * (k - 1)
Python code:
def closest_subset(nums: List[int], k: int) -> List[int]:
"""Given a list of n (poss. unsorted and non-unique) integers nums,
returns a (sorted) list of size k that minimizes the sum of pairwise
distances between all elements in the list.
Runs in O(n lg n) time, uses O(n) auxiliary space.
"""
n = len(nums)
assert len(nums) == n
assert 2 <= k <= n
nums.sort()
# Sum of pairwise distances to the next (at most) k-1 elements
dist_sum_to_next = [0] * n
# Sum of pairwise distances to the last (at most) k-1 elements
dist_sum_to_prev = [0] * n
for i in range(1, n):
if i >= k:
dist_sum_to_prev[i] = ((dist_sum_to_prev[i - 1] -
(nums[i - 1] - nums[i - k]))
+ (nums[i] - nums[i - 1]) * (k - 1))
else:
dist_sum_to_prev[i] = (dist_sum_to_prev[i - 1]
+ (nums[i] - nums[i - 1]) * i)
for i in reversed(range(n - 1)):
if i < n - k:
dist_sum_to_next[i] = ((dist_sum_to_next[i + 1]
- (nums[i + k] - nums[i + 1]))
+ (nums[i + 1] - nums[i]) * (k - 1))
else:
dist_sum_to_next[i] = (dist_sum_to_next[i + 1]
+ (nums[i + 1] - nums[i]) * (n-i-1))
best_sum = math.inf
curr_sum = 0
answer_right_bound = 0
for i in range(n):
curr_sum += dist_sum_to_prev[i]
if i >= k:
curr_sum -= dist_sum_to_next[i - k]
if curr_sum < best_sum and i >= k - 1:
best_sum = curr_sum
answer_right_bound = i
return nums[answer_right_bound - k + 1:answer_right_bound + 1]