Suppose we have data a₁, ..., aₙ, where n is an even integer and each aᵢ ∈ ℝ. Also define the distance between two pairs of elements dis(aᵢ, aⱼ) = | aᵢ − aⱼ |. Now the program should output a list of pairs of elements sorted by the distance in an ascending order. Also the program should pack the input data into pairs, therefore each element aᵢ would only appear once in the output.
For example, given the input [1, 0.4, 3, 1.1] the output should be [(1, 1.1), (0.4, 3)].
A naive brute-force method is to calculate all C(n,2) pair and sorted the distance of each pair.
def not_in_list_of_pair(i, ls):
return not i in [p[0] for p in ls] + [p[1] for p in ls]
def calc(ls):
ls = sorted(ls)
d ={}
for idx1, i in enumerate(ls[:-1]):
for idx2, j in enumerate(ls[idx1+1:], idx1 + 1):
d[(i,j)] = j - i
# 2nd part
res = []
for pair in sorted(d, key = lambda k: d[k]):
i, j = pair
if not_in_list_of_pair(i, res) and not_in_list_of_pair(j, res):
res.append(pair)
return res
# another example
ls = [1, 0.1, 2, 2.4, 3, 4, 1.5]
assert calc(ls) == [(2, 2.4), (1, 1.5), (3, 4)]
But this naive method only works in O(n²), and the 2nd part (extracting min distance) is also slow. Therefore I am looking for a more effective method to solve this problem. Thanks!
I have to say that your descrption of the problem is not clear and the complexity in the description is not correct, i.e., you have to calculate the distance of all the pairs of integers (which is O(n^2)) and after that you sort all the distance (which is O(n^2 * log(n^2))).
For this problem, you are basically finding two integers with smallest distance, pick these two integers out, and repeat the same process on the remaining integers.
One naive solution is, supposed the integers are sorted, and we only find one pair of integers with smallest distance, then we just need to calculate the distance of each two adjacent integers (e.g., dist between ls[0] and ls[1], between ls[1] and ls[2], ..., between ls[n - 2] and ls[n - 1]) and find out which pair is the smallest. After we find one, remove the two selected integers, the remaining integers are still sorted. If we want to find the next pair of integers with smallest distance, the problem remains the same.
The naive solution is still expensive in two aspsects: (1) we need to calculate the distance of each two adjacent integers each time; (2) we need to remove two integers from a sorted array and keep the array sorted.
To solve (1), in fact, we don't have to calculate the all the distances each time. E.g., suppose we have 6 integers where we calculated dist(0, 1), dist(1, 2), dist(2, 3), dist(3, 4), dist(4, 5). We find that the 2nd and the 3rd integers are the closet ones, so we output and remove the 2nd and the 3rd integers. For the next round, we need to calculate dist(0, 1), dist(1, 4), dist(4, 5). We can see that we only need to remove dist(1, 2) and dist(3, 4) as they're useless, but we need to add a new distance dist(1, 4) while dist(0, 1) and dist(4, 5) are not changed. We can maintain a btree to achieve the purpose.
To solve (2), the best data structure where we can remove items from the middle is double linked list with complexity O(1). But we are using array now and we may not want to change array to linked list. One way is that we use index array to mimic a double linked list.
Here is an example.
Update 1: I found OrderedDict does not pop the minimal item each time. I don't find any data structure in python that works as btree. I have to use a heap where I cannot delete those useless distance but I can identiy and ignore them. Sorry for the mistake.
Update 2: Add a else branch in the while loop, i.e., we should not change the double linked list when we see a useless item.
Update 3: Just realize that the heap will have no more than n items in each iteration in the while loop. So the complexity is roughly O(n log n), with n being the number of integers.
from heapq import *
def calc(ls):
ls = sorted(ls) # O(nlogn)
n = len(ls)
# mimic a double linked list
left = [i - 1 for i in range(n)]
right = [i + 1 for i in range(n)]
appeared = [False for i in range(n)]
btree = []
for i in range(0, n - 1):
# distance of adjacent integers, and their indices
heappush(btree, (ls[i + 1] - ls[i], i, i + 1))
# roughly O(n log n), because the heap will have at most `n` items in each iteration
result = []
while len(btree) != 0:
minimal = heappop(btree)
a, b = minimal[1:3]
# skip if either a or b appeared
if not appeared[a] and not appeared[b]:
result.append((ls[a], ls[b]))
appeared[a] = True
appeared[b] = True
else:
continue # this is important
#print result
if left[a] != -1:
right[left[a]] = right[b]
if right[b] != n:
left[right[b]] = left[a]
if left[a] != -1 and right[b] != n:
heappush(btree, (ls[right[b]] - ls[left[a]], left[a], right[b]))
return result
ls = [1, 0.1, 2, 2.4, 3, 4, 1.5]
print calc(ls)
With the following output:
[(2, 2.4), (1, 1.5), (3, 4)]
Note: The number of input integers is 7, which is NOT even.
Show one more image to present what is going on:
I am not very familiar with Python, so I may not be using the best data structure in the above code snippet.
Related
I have lists of data points, which I look at to see if they are above a certain threshold.
I can calculate the percentage of total points above the threshold, but I need index and points of all points above the threshold. e.g.
points_above_threshold = [1,1,1,0,0,0,1,1]
1 is yes, 0 is no
I need a function which returns, points in the format:
[line_points,[start_index, end_index]
e.g. the output of points_above_threshold would be
[3,(0,2)],[2,(6,7)]
Your question is lacking some detail about the format of the data you're working with. A good starting point is to specify precisely the expected input and output for your function.
For example, if your data is a list of numbers (floats) like this:
[1.56, 2.45, 8.43, ... ]
your threshold is a single floating point number, and your output is expected to be a list of tuples (index, data_point) like this:
[(1, 2.45), (2, 8.43), ... ]
Then you can write a function that that looks something like this:
def get_points_above_threshold(data_list, threshold):
output = []
for idx, point in enumerate(data_list):
if point > threshold:
output.append((idx, point))
return output
I'll attempt to answer how to implement the points_above_threshold function you describe. We can alter the above function slightly with a tracking system to calculate the index ranges of values that are above the threshold like this:
def compute_ranges(values, threshold):
start_range = None #
ranges = [] # tuples (start_idx, end_idx), inclusive
for idx, value in enumerate(values): #
if value <= threshold: # This either ends an "active" range, or does nothing if there isn't one.
if start_range is None: # If no current range, continue
continue #
ranges.append((start_range, idx-1)) # Otherwise end current range, append it to ranges, and reset range variables
start_range = None #
else: # Otherwise, we either start an "active" range or continue one that already exists
if start_range is None: #
start_range = idx #
if start_range is not None: # If still an active range, append it (since range could end at end of list)
ranges.append((start_range, #
len(values)-1)) #
final = [(r[1]-r[0]+1, r) for r in ranges] # Do final convert that includes length of range to output
return final
If we apply this function to a list of numbers with a given threshold, it will output the ranges in the way you describe above. For example, if the input list is the simple example
[1,1,1,0,0,0,1,1]
and the threshold is say 0.5, then the output is
[(3, (0, 2)), (2, (6, 7))]
Using enumerate and pairwise iteration we can achieve what you want.
# enumerate helps us to isolate the indexes of 1's
points_above_threshold = [1,1,1,0,0,0,1,1]
id_ = [i for i,e in enumerate(a) if e == 1] # list comprehension
print(id_)
[0, 1, 2, 6, 7] # all indexes of 1's
# pairwise iteration helps us find the
# sequences of indexes, e.g. (0,1,2) and (6,7) are sequences
pairwise = [[]]
for item1, item2 in list(zip(id_, id_[1:])):
if item2-item1 == 1:
if not pairwise[-1]:
pairwise[-1].extend((item1,item2))
else:
pairwise[-1].append(item2)
elif pairwise[-1]:
pairwise.append([])
print(pairwise)
[[0, 1, 2], [6, 7]]
# with the code above we've just iterate over the id_ list
# and create another list with the sequences nested
# now using list comprehension we can achieve the output,
# but with tuples nested inside a list
points_above_threshold = [(len(i), (i[0], i[-1])) for i in pairwise]
print(points_above_threshold)
[(3, (0, 2)), (2, (6, 7))]
Hope this is helpful!
I've been studying python algorithm and would like to solve a problem that is:
A positive integer array A and an integer K are given.
Find the largest even sum of the array A with K elements.
If not possible, return -1.
For example, if there is an array A= [1,2,3,4,4,5] and K= 3,
the answer is 12 (5+4+3),
which is the largest even sum with K (3) elements.
However, if A= [3, 3, 3] and K= 1,
the answer is -1 because it cannot make an even sum with one element.
I tried to exclude every minimum odd from the array, but it failed when K=n in the while loop.
Is there any simple way to solve this problem? I would sincerely appreciate if you could give some advice.
Sort the array and "take" the biggest K elements.
If it's already even sum - you are done.
Otherwise, you need to replace exactly one element, replace an even element you have chosen with an odd one you have not, or the other way around. You need the difference between the two elements to be minimized.
A naive solution will check all possible ways to do that, but that's O(n^2). You can do better, by checking the actual two viable candidates:
The maximal odd element you did not choose, and the minimal
even element you have chosen
The maximal even element you did not choose and the minimal odd element you have chosen.
Choose the one that the difference between the two elements is minimal. If no such two elements exist (i.e. your k=3, [3,3,3] example) - there is no viable solution.
Time complexity is O(nlogn) for sorting.
In my (very rusty) python, it should be something like:
def FindMaximalEvenArray(a, k):
a = sorted(a)
chosen = a[len(a)-k:]
not_chosen = a[0:len(a)-k]
if sum(chosen) % 2 == 0:
return sum(chosen)
smallest_chosen_even = next((x for x in chosen if x % 2 == 0), None)
biggest_not_chosen_odd = next((x for x in not_chosen[::-1] if x % 2 != 0), None)
candidiate1 = smallest_chosen_even - biggest_not_chosen_odd if smallest_chosen_even and biggest_not_chosen_odd else float("inf")
smallest_chosen_odd = next((x for x in chosen if x % 2 != 0), None)
biggest_not_chosen_even = next((x for x in not_chosen[::-1] if x % 2 == 0), None)
candidiate2 = smallest_chosen_odd - biggest_not_chosen_even if smallest_chosen_odd and biggest_not_chosen_even else float("inf")
if candidiate1 == float("inf") and candidiate2 == float("inf"):
return -1
return sum(chosen) - min(candidiate1, candidiate2)
Note: This can be done even better (in terms of time complexity), because you don't actually care for the order of all elements, only finding the "candidates" and the top K elements. So you could use Selection Algorithm instead of sorting, which will make this run in O(n) time.
Given a list of lists of tuples, I would like to find the subset of lists which maximize the number of distinct integer values without any integer being repeated.
The list looks something like this:
x = [
[(1,2,3), (8,9,10), (15,16)],
[(2,3), (10,11)],
[(9,10,11), (17,18,19), (20,21,22)],
[(4,5), (11,12,13), (18,19,20)]
]
The internal tuples are always sequential --> (1,2,3) or (15,16), but they may be of any length.
In this case, the expected return would be:
maximized_list = [
[(1, 2, 3), (8, 9, 10), (15, 16)],
[(4, 5), (11, 12, 13), (18, 19, 20)]
]
This is valid because in each case:
Each internal list of x remains intact
There is a maximum number of distinct integers (16 in this case)
No integer is repeated.
If there are multiple valid solutions, all should be return in a list.
I have a naive implementation of this, heavily based on a previous stackoverflow question I had asked, which was not as well formed as it could have been (Python: Find tuples with greatest total distinct values):
import itertools
def maximize(self, x):
max_ = 0
possible_patterns = []
for i in range(1, len(x)+1):
b = itertools.combinations(x, i)
for combo in b:
all_ints = tuple(itertools.chain(*itertools.chain(*combo)))
distinct_ints = tuple(set(all_ints))
if sorted(all_ints) != sorted(distinct_ints):
continue
else:
if len(all_ints) >= max_:
if len(all_ints) == max_:
possible_patterns.append(combo)
new_max = len(all_ints)
elif len(all_ints) > max_:
possible_patterns = [combo]
new_max = len(all_ints)
max_ = new_max
return possible_patterns
The above-mentioned function appears to give me the correct result, but does not scale. I will need to accept x values with a few thousand lists (possibly as many as tens of thousands), so an optimized algorithm is required.
The following solves for the maximal subset of sublists, with respect to cardinality. It works by flattening each sublist, constructing a list of sets of intersections between the sublists, and then searches the solution space in a depth-first-search for the solution with the most elements (i.e. largest "weight").
def maximize_distinct(sublists):
subsets = [{x for tup in sublist for x in tup} for sublist in sublists]
def intersect(subset):
return {i for i, sset in enumerate(subsets) if subset & sset}
intersections = [intersect(subset) for subset in subsets]
weights = [len(subset) for subset in subsets]
pool = set(range(len(subsets)))
max_set, _ = search_max(pool, intersections, weights)
return [sublists[i] for i in max_set]
def search_max(pool, intersections, weights):
if not pool: return [], 0
max_set = max_weight = None
for num in pool:
next_pool = {x for x in pool - intersections[num] if x > num}
set_ids, weight = search_max(next_pool, intersections, weights)
if not max_set or max_weight < weight + weights[num]:
max_set, max_weight = [num] + set_ids, weight + weights[num]
return max_set, max_weight
This code can be optimized further by keeping a running total of the "weights" (sum of cardinalities of sublists) discarded, and pruning that branch of the search space when it exceeds that of the maximal solution so far (which will be the minimal discard weight). Unless you run into performance problems however, this will likely be more work than its worth, and for a small list of lists the overhead of the computation will exceed the speedup of pruning.
import math
n=7 #length of list
k=2 #number
arr=[1,1,1,1,4,5,1]
l=n
def segmentedtree(segmentedtreearr,arr,low,high,pos): #function to build segment tree
if low==high:
segmentedtreearr[pos]=arr[high]
return
mid=(low+high)//2
segmentedtree(segmentedtreearr,arr,low,mid,((2*pos)+1))
segmentedtree(segmentedtreearr,arr,mid+1,high,((2*pos)+2))
segmentedtreearr[pos]=segmentedtreearr[((2*pos)+1)]+segmentedtreearr[((2*pos)+2)]
flag=int(math.ceil(math.log2(n))) #calculating height of segment tree
size=2*int(math.pow(2,flag))-1#calculating size
segmentedtreearr=[0]*(size)
low=0
high=l-1
pos=0
segmentedtree(segmentedtreearr,arr,low,high,pos)
if (n%2==0):
print (segmentedtreearr.count(k)+1)
else:
print (segmentedtreearr.count(k))
Now arr=[1,1,1,1,4,5,1] so different possible combinations for sum equal to k=2 can be [1,1] using index (0,1) and [1,1] using index (1,2) and [1,1] using index (2,3) but i am getting 2 as a output although my implementation is correct.
Segment trees are good for looking up ranges when you have an absolute point, but in your case you have a relative measure you are looking for (a sum).
Your code is missing a pair of ones that are in two different branches of the tree:
As you can imagine, larger sums could span several branches (like for sum = 7). There is no trivial way to make use of this tree to answer the question.
It is much easier with a simple iteration through the list, using two indexes (left and right of a range), incrementing the left index when the sum is too large and incrementing the right index when it is too small. This assumes that all values in the input list are positive, which is stated in your reference to hackerrank:
def count_segments_with_sum(lst, total):
i = 0
count = 0
for j, v in enumerate(lst):
total -= v
while total < 0:
total += lst[i]
i += 1
count += not total
return count
print(count_segments_with_sum([1,1,1,1,4,5,1], 2)) # -> 3
Here is an O(n) solution discarding the tree approach. It uses accumulate and groupby from itertools and merge from heapq:
It is not very optimized. My focus was on demonstrating the principle and using vectorizable components.
import itertools as it, operator as op, heapq as hq
arr=[1,1,1,1,4,5,1]
k = 2
N = len(arr)
# compute cumulative sum (starting at zero) and again shifted by `-k`
ps = list(it.chain(*(it.accumulate(it.chain((i,), arr), op.add) for i in (0,-k))))
# merge the cumsum and shifted cumsum, do this indirectly (index based); observe that any eligible subsequence will result in a repeated number in the merge
idx = hq.merge(range(N+1), range(N+1, 2*N+2), key=ps.__getitem__)
# use groupby to find repeats
grps = (list(grp) for k, grp in it.groupby(idx, key=ps.__getitem__))
grps = (grp for grp in grps if len(grp) > 1)
grps = [(i, j-N-1) for i, j in grps]
Result:
[(0, 2), (1, 3), (2, 4)]
Some more detailed explanation:
1) we build the sequence ps = {0, arr_0, arr_0 + arr_1, arr_0 + arr_1 + arr_2, ...} of cumulative sums of arr. This is useful because andy sum of a stretch of elements can be written as the difference between two terms in ps.
2) in particular, a contiguous subsequence that sums to k will correspond to a pair of elements of ps whose difference is k. To find those we make a copy of ps and subtract k from each element. We therefore need to find numbers that are in ps and in the shifted ps.
3) because ps and ps shifted are sorted (assuming the terms of arr are positive) the numbers that are in ps and ps shifted can be found in O(n) using merge which puts such pairs next to each other. If I remember correctly, the merge is guaranteed to be stable, so we can rely on the element from ps coming first in any such pair.
4) it remains to find the pairs which we do using groupby.
5) But wait a minute. If we do this directly all we got in the end are pairs of equal values. If you just want to count them that's fine, but if we want the actual sublists we have to do the merge indirectly, using the key kwd arg which works in the same way as in sorted
6) So we create two ranges of indices and use list.__getitem__ as key function because we have two lists but can only pass one key, we concatenate the lists first. As a consequence the indices into the first and second list are unique.
7) the result is a list of indices idx such that ps[idx[0]], ps[idx[1]], ... is sorted (ps in the program is ps with ps-k already glued to it) using the same key function as before we can do the groupby indirectly, on idx.
8) we then discard all groups that have only a single element and for the remaining pairs shift back the second index.
I am working on DP solution for a knapsack problem. Having a list of items with their weights and value, I need to find the items with the maximum total value less then some predefined weight. Nothing special, just 0-1 knapsack.
I use DP to generate a matrix:
def getKnapsackTable(items, limit):
matrix = [[0 for w in range(limit + 1)] for j in xrange(len(items) + 1)]
for j in xrange(1, len(items) + 1):
item, wt, val = items[j-1]
for w in xrange(1, limit + 1):
if wt > w:
matrix[j][w] = matrix[j-1][w]
else:
matrix[j][w] = max(matrix[j-1][w], matrix[j-1][w-wt] + val)
return matrix
where items are a list of tuples (name, weight, value). Now having a DP matrix, he maximum possible value is the number in the right down position. I can also backtrack the matrix to find the list of items that gives the best solution.
def getItems(matrix, items):
result = []
I, j = len(matrix) - 1, len(matrix[0]) - 1
for i in range(I, 0, -1):
if matrix[i][j] != matrix[i-1][j]:
item, weight, value = items[i - 1]
result.append(items[i - 1])
j -= weight
return result
Great, now I can get the results:
items = [('first', 1, 1), ('second', 3, 8), ('third', 2, 5), ('forth', 1, 1), ('fifth', 1, 2), ('sixth', 5, 9)]
matrix = getKnapsackTable(items, 7)
print getItems(matrix, items)
and will see: [('fifth', 1, 2), ('third', 2, 5), ('second', 3, 8), ('first', 1, 1)].
The problem is that this is not a unique solution. Instead of the 'first' element, I can take the 'forth' element (which is absolutely the same, but sometimes the solutions can be different). I am trying to figure out how to get all the solutions instead of just one. I realize that it will take more time, but I am ok with that.
You can compute the original DP matrix as usual (i.e., using DP), but to find all optimal solutions you need to recurse as you travel back through the matrix from the final state. That's because any given state (i, j) in your matrix has at least one optimal predecessor state, but it might have two: it might be that the maximum value for state (i, j) can be achieved either by choosing to add item i to the optimal solution for state (i-1, j-w(i)), or by leaving item i out and just keeping the optimal solution for (i-1, j). This occurs exactly when these two choices yield equal total values, i.e., when
matrix[i-1][j] == matrix[i-1][j-w(i)]+v(i),
where w(i) and v(i) are the weight and value of object i, respectively. Whenever you detect such a branching, you need to follow each branch.
Note that there could be an extremely large number of optimal solutions: e.g., consider the case when all items have weight 1. In this case, all (n choose w) solutions are optimal.