set / line intersection solution

set / line intersection solution - python

I have two lists in python and I want to know if they intersect at the same index. Is there a mathematical way of solving this?
For example if I have [9,8,7,6,5] and [3,4,5,6,7] I'd like a simple and efficient formula/algorithm that finds that at index 3 they intersect. I know I could do a search just wondering if there is a better way.
I know there is a formula to solve two lines in y = mx + b form by subtracting them from each other but my "line" isn't truly a line because its limited to the items in the list and it may have curves.
Any help is appreciated.

You could take the set-theoretic intersection of the coordinates in both lists:
intersecting_points = set(enumerate(list1)).intersection(set(enumerate(list2)))
...enumerate gives you an iterable of tuples of indexes and values - in other words, (0,9),(1,8),(2,7),etc.
http://docs.python.org/library/stdtypes.html#set-types-set-frozenset
...make sense? Of course, that won't truly give you geometric intersection - for example, [1,2] intersects with [2,1] at [x=0.5,y=1.5] - if that's what you want, then you have to solve the linear equations at each interval.

from itertools import izip
def find_intersection(lineA, lineB):
for pos, (A0, B0, A1, B1) in enumerate(izip(lineA, lineB, lineA[1:], lineB[1:])):
#check integer intersections
if A0 == B0: #check required if the intersection is at position 0
return pos
if A1 == B1: #check required if the intersection is at last position
return pos + 1
#check for intersection between points
if (A0 > B0 and A1 < B1) or
(A0 < B0 and A1 > B1):
#intersection between pos and pos+1!
return pos + solve_linear_equation(A0,A1,B0,B1)
#no intersection
return None
...where solve_linear_equation finds the intersection between segments (0,A0)→(1,A1) and (0,B0)→(1,B1).

I assume one dimension in your list is assumed e.g. [9,8,7,6,5] are heights at x1,x2,x3,x4,x5 right? in that case how your list will represent curves like y=0 ?
In any case I don't think there can be any shortcut for calculating intersection of generic or random curves, best solution is to do a efficient search.

import itertools
def intersect_at_same_index(seq1, seq2):
return (
idx
for idx, (item1, item2)
in enumerate(itertools.izip(seq1, seq2))
if item1 == item2).next()
This will return the index where the two sequences have equal items, and raise a StopIteration if all item pairs are different. If you don't like this behaviour, enclose the return statement in a try statement, and at the except StopIteration clause return your favourite failure indicator (e.g. -1, None…)

Related

How to find string similar to 2 other strings (in terms of Levenshtein distance)?

Let's say I have 2 strings which is pretty similar. I want to find other string which is close to s1 and s2 in terms of Levenshtein distance.
import Levenshtein
s1 = 'aaabbbccc'
s2 = 'abacbbbccde'
dist = Levenshtein.distance(s1, s2)
print(dist)
mid_str = get_avg_string(s1, s2)
What can be effective implementation of function:
def get_avg_string(s1, s2):
return ''
I need that this variables:
sum_lev = Levenshtein.distance(s1, mid_str) + Levenshtein.distance(s2, mid_str)
diff_lev = abs(Levenshtein.distance(s1, mid_str) - Levenshtein.distance(s2, mid_str)
to be minimal (I think sum_lev will be equal to dist and diff_lev <= 1).

I'm afraid that what you ask for is not possible, since the problem is NP-hard. I will try to outline a few of the key concepts for why that is the case, but I'd encourage you to look up Center Strings and Steiner Strings.
Suppose that you have a set of strings called S. The optimal Steiner String is a string which minimizes the sum of the distances of each string in S to itself (also known as consensus error). This corresponds to the first property, which you called sum_lev. The optimal Steiner String is usually ambiguous and not part of the original set S (but doesn't have to be).
The problem you are facing is that there is no efficient way to compute the optimal Steiner String. Even if you manage to restrict your search space, you will still have an exponential amount of candidates. The problem is hence NP-hard.
It can be proven that S always contains a string which is a reasonable approximation of the optimal Steiner String. So even if you only pay attention to the first of your two properties, the best shot you have is to simply select one of your original strings. Since you are apparently only dealing with two strings, it should not matter which one you choose.
TL;DR
To summarize, you are dealing with an NP-hard problem which can not be solved efficiently, but only approximated. If you are only dealing with two strings, the string you are looking for can be approximated by one of the given strings. I'm sorry that this is probably not the answer you hoped for, but hopefully it was still somewhat helpful.

Assumption
So first of all let's assume string1(lets call it s1) is the before and string2(lets call it s2) after transformation. This way we can easily separate add and remove character operations.
The example
Let's consider example given by you.
Levenshtein.distance(s1='aaabbbccc', s2='abacbbbccde')
This means we are asking question how many operations separete these string(How much it costs to mutate one into other).
Levenshtein matrix
Now that we assume s1 is the start point, let's see what the algorithm gives us.
We can calculate that distance between s1 and s2, and it spits out integer value of 4
It comes from the last value of the calculated Levenshtein matrix, like so:
Walk Levennshtein matrix
As we can see there are places where value goes up and where it stays the same.
If we go over the matrix from the top left corner, we should read it like:
going down means: adding a character to the s1 string
going right means: removing a character from the s1 string
going diagonally down-right means: replacing the character
Our goal is to get to the bottom-right corner and the result value is the cost(or distance) that is associated with it.
Distance change in matrix
Let's see how matrix will change its values when we change last value in s1
As we can see previous intersection of cxd changed to dxd and now the cost does not rise in this place and that results in smaller distance between those strings.
What we see is that small changes in s1 will result in distance change of 1 and
if we compare original s1 to the modified one:
It look preety damn close in term of Levenshtein distance.
Conclusion
There is potentially an algorithm to generate quite a lot of strings similar to s1 and s2.
It would go over generated matrix and change single character in a string to generate next solution.
We should consider multiple changes done to a original matrix. And then for each new solution we potentially want to calculate new Levenshtein matrix and use it as next source to generate solutions.
And we should never consider only lowering these values, that would generate only portion of potential solutions.
One thing that is important to consider. In term of Levenshtein distance it does not matter if we compare character a to b or a to c all that is important is that "Are those the same character?" if not we do not care about its value.

A little expansion of #Matze's answer - when you have NP-hard problem to solve you could use genetic algorithm to find some solution that might be better then just taking one of strings in finite time (no guarantees that it would be best or even better then one of original strings)

This is not a solution but something to start with; an immediate improvement would be on function eq_len_strings that is equalizing the strings to the right; also you can take sub-strings of s1 on s2 to create your first mid-string (since this helps reduce the Levenshtein distance) and then just fill the _ with any character on your alphabet as search_mid_string does.
Another improvement is avoid (contrary to what I do) to fill all blanks (_) maybe adding the empty string to your alphabet or considering the difference in length for both strings.
import Levenshtein
def eq_len_strings(s1, s2):
if len(s1) < len(s2):
s1 = s1.ljust(len(s1)+len(s2)-len(s1), '_')
elif len(s1) > len(s2):
s2 = s2.ljust(len(s2)+len(s1)-len(s2), '_')
return s1, s2
def keep_eq_chars(s1, s2):
s = ''
for char1, char2 in zip(s1, s2):
if char1 == '_' or char2 == '_' or char1 != char2:
s += '_'
else:
s += char1
return s
def search_mid_string(s1, s2):
alph = set(list(s1+s2))
s1e, s2e = eq_len_strings(s1, s2)
# start string
s_mid = list(keep_eq_chars(s1e, s2e))
alternate = 0
for i in range(len(s_mid)):
char1 = s1[i] if i < len(s1)-1 else '_'
char2 = s2[i] if i < len(s2)-1 else '_'
if s_mid[i] == '_':
if alternate == 0 and char1 != '_':
s_mid[i] = char1
alternate = 1
elif alternate == 1 and char2 != '_':
s_mid[i] = char2
alternate = 0
else:
s_mid[i] = ''
s1_to_s2_dist = Levenshtein.distance(s1, s2)
s1_to_mid_dist = Levenshtein.distance(s1, ''.join(s_mid))
s2_to_mid_dist = Levenshtein.distance(s2, ''.join(s_mid))
ans = {
's1_to_s2_dist': s1_to_s2_dist,
's1_to_mid_dist': s1_to_mid_dist,
's2_to_mid_dist': s2_to_mid_dist,
's_mid': ''.join(s_mid)
}
return ans
With the strings given:
s1 = 'aaabbbccc'
s2 = 'abacbbbccde'
search_mid_string(s1, s2)
// Output
>{'s1_to_s2_dist': 4,
> 's1_to_mid_dist': 2,
> 's2_to_mid_dist': 3,
> 's_mid': 'aaacbbcccd'}

This is not a solution, but a very basic idea to finding a string that is close to two strings:
import Levenshtein
s1 = 'aaabbbccc'
s2 = 'abacbbbccde'
dist = Levenshtein.distance(s1, s2)
print(dist)
def get_avg_string(s1, s2):
s2, s1 = sorted([s1, s2], key=len)
s3 = ''.join(a if i & 1 else b for i, (a, b) in enumerate(zip(s1, s2)))
return s3
mid_str = get_avg_string(s1, s2)
print(Levenshtein.distance(s1, mid_str))
print(Levenshtein.distance(s2, mid_str))
Output:
4
2
3
Explaining the function
def get_avg_string(s1, s2):
s2, s1 = sorted([s1, s2], key=len)
s3 = ''.join(a if i & 1 else b for i, (a, b) in enumerate(zip(s1, s2)))
return s3
The loop
for i, (a, b) in enumerate(zip(s1, s2))
will iterate through strings s1 and s2 in parallel using the zip() method, along with the index of each iteration with the enumerate() method.
The condition
a if i & 1 else b
will take the part from s1, a, if the index is odd (if & 1 returns 1 rather than 0), else the part from s2, b.
I used
s2, s1 = sorted([s1, s2], key=len)
to make the line below it take the first character from the longest string.

Here is the code which returns list of all strings between s1 and s2. So you can choose the one in the middle.
def get_all_avg_strings(s1, s2):
import Levenshtein
dist = Levenshtein.distance(s1, s2)
s1_new = s1
intermediate_strings = [s1_new]
for i in range(dist):
e = Levenshtein.editops(s1_new, s2)
s1_new = Levenshtein.apply_edit(e[:1], s1_new, s2)
intermediate_strings.append(s1_new)
check = Levenshtein.distance(s2, s1_new)
if check != 0:
print('Some error here!')
exit()
return intermediate_strings[1:-1]
Then you can create the requested function as:
def get_avg_string(s1, s2):
avg = get_all_avg_strings(s1, s2)
if len(avg) > 0:
s = avg[len(avg) // 2]
else:
s = s1
return s

how to calculate overlap length between two points array in python?

Suppose there are 2 arrays. Every element in the array is short line contains start position and end position.
a1 = [[0,1],[3,6],[7,9]]
a2 = [[2,6],[0,1]]
In this example, a1[0] is same with a2[1], the overlap length is 1. a1[1] and a2[0] has overlap length of 3. The total result is 4.
Are there any way to achieve this method easily?

You can use itertools.product to generate all interval pairs and then calculate the overlap for each pair. Two intervals overlap if one starts before the second ends.
import itertools
overlap=0
for x, y in itertools.product(a1, a2):
max_start = max(x[0], y[0])
min_end = min(x[1], y[1])
overlap += max(0, min_end-max_start)

There is an ambiguity in the problem statement: Can intervals in the same set overlap each other, and if so, do we double count the overlap of those intervals with an interval in the other set or not?
Anyway, The brute-force approach will take O(N^2) time, which may be fine depending on how large the sets are. But it can be improved to O(N*logN) by sorting the two sets by the starting points. If overlapping within the same set is not allowed, you can simply go from left two right, keeping track of the last intervals in each set that overlap each other. If overlapping within the same set is allowed, you can keep a heap of intervals of the first set of which the endpoints have not been reached, and iterate over the second set
In the case of non-overlapping intervals within the same set, the code will be something like this:
a1 = [[0,1],[3,6],[7,9]]
a2 = [[2,6],[0,1]]
a1.sort(key = lambda x: x[0])
a2.sort(key = lambda x: x[0])
i1 = 0
i2 = 0
overlapping = 0
while i1 < len(a1) and i2 < len(a2):
# start and end of the overlapping
start = max(a1[i1][0], a2[i2][0])
end = min(a1[i1][1], a2[i2][1])
overlapping += max(0, end-start)
# move the interval that ends first to the next interval in the same set
if a1[i1][1] < a2[i2][1]:
i1 += 1
else:
i2 += 1
print(overlapping)

Create a Sequence With Two of Each Element Satisfying Certain Distance Criteria

Given a sequence of distinct items Sa, we wish to create a sequence Sb (composed of the same items in Sa, but in a different order) such that the sequence S = Sa + Sb (sequence Sb appended immediately after sequence Sa) satisfies the following properties:
The distance (number of positions) between the two occurrences of item I in S is at least some number T for all items I.
If items I and J are within N positions in Sa, then I and J are not within N positions in Sb.
I've been able to program the first stipulation in Python fairly simply. However, the second one is where I struggle. Essentially, I'm just wanting these two things:
I want the second sequence to have its items "far away enough" from their occurrence in the first sequence.
I don't want neighbors of the first sequence to also be neighbors in the second sequence (with N referring to the distance in which items are considered neighbors).
Here's what I have so far:
import random
clips = list(range(10)) # arbitrary items
choice_pool = clips[:]
Sa = clips[:]
random.shuffle(Sa)
Sb = []
count = len(Sa)
threshold = 0.5*len(clips) # the minimum distance the item has to be away from itself in the concatenated sequence
while len(Sb) != len(Sa):
jj = random.randint(0, len(choice_pool) - 1)
# we want clip a1 to be at least threshold away from clip b1
if count - Sa.index(choice_pool[jj]) >= threshold:
Sb.append(choice_pool[jj])
del choice_pool[jj]
count += 1
print("Sa:", Sa)
print("Sb:", Sb)
print("S :", Sa + Sb)
Do you have any advice on how to also achieve the second stipulation, while always guaranteeing such a sequence exists (not ending up in an infinite loop)? Thanks.

I would take out the chance of randomness out of the equation. With randomness you are never guaranteed that you aren't stuck in an infinite loop. There are better improvements to this algorithm but here is the base.
import itertools as it
def filter_criteria(sequence):
#put your filters here return True if you find a sequence that works
pass
for sb_try in it.permutations(sa, len(sa)):
if filter_criteria(sa+sb_try):
return sb_try
return "no permutation matches"

Optimizing this dynamic programming solution

The Problem:
You are given an array m of size n, where each value of m is composed of a weight w, and a percentage p.
m = [m0, m1, m2, ... , mn] = [[m0w, m0p], [m1w, m1p], [m2w, m2p], ..., [mnw, mnp] ]
So we'll represent this in python as a list of lists.
We are then trying to find the minimum value of this function:
# chaima is so fuzzy how come?
def minimize_me(m):
t = 0
w = 1
for i in range(len(m)):
current = m[i]
t += w * current[0]
w *= current[1]
return t
where the only thing we can change about m is its ordering. (i. e. rearrange the elements of m in any way) Additionally, this needs to complete in better than O(n!).
Brute Force Solution:
import itertools
import sys
min_t = sys.maxint
min_permutation = None
for permutation in itertools.permutations(m):
t = minimize_me(list(permutation), 0, 1)
if t < min_t:
min_t = t
min_permutation = list(permutation)
Ideas On How To Optimize:
the idea:
Instead of finding the best order, see if we can find a way to compare two given values in m, when we know the state of the problem. (The code might explain this more clearly). If I can build this using a bottom-up approach (so, starting from the end, assuming I have no optimal solution) and I can create an equation that can compare two values in m and say one is definitively better than the other, then I can construct an optimal solution, by using that new value, and comparing the next set of values of m.
the code:
import itertools
def compare_m(a, b, v):
a_first = b[0] + b[1] * (a[0] + a[1] * v)
b_first = a[0] + a[1] * (b[0] + b[1] * v)
if a_first > b_first:
return a, a_first
else:
return b, b_first
best_ordering = []
v = 0
while len(m) > 1:
best_pair_t = sys.maxint
best_m = None
for pair in itertools.combinations(m, 2):
m, pair_t = compare_m(pair[0], pair[1], v)
if pair_t < best_pair_t:
best_pair_t = pair_t
best_m = m
best_ordering.append(best_m)
m.remove(best_m)
v = best_m[0] + best_m[1] * v
first = m[0]
best_ordering.append(first)
However, this is not working as intended. The first value is always right, and roughly 60-75% of the time, the entire solution is optimal. However, in some cases, it looks like the way I am changing the value v which then gets passed back into my compare is evaluating much higher than it should. Here's the script I'm using to test against:
import random
m = []
for i in range(0, 5):
w = random.randint(1, 1023)
p = random.uniform(0.01, 0.99)
m.append([w, p])
Here's a particular test case demonstrating the error:
m = [[493, 0.7181996086105675], [971, 0.19915848527349228], [736, 0.5184210526315789], [591, 0.5904761904761905], [467, 0.6161290322580645]]
optimal solution (just the indices) = [1, 4, 3, 2, 0]
my solution (just the indices) = [4, 3, 1, 2, 0]
It feels very close, but I cannot for the life of me figure out what is wrong. Am I looking at this the wrong way? Does this seem like it's on the right track? Any help or feedback would be greatly appreciated!

We don't need any information about the current state of the algorithm to decide which elements of m are better. We can just sort the values using the following key:
def key(x):
w, p = x
return w/(1-p)
m.sort(key=key)
This requires explanation.
Suppose (w1, p1) is directly before (w2, p2) in the array. Then after processing these two items, t will be increased by an increment of w * (w1 + p1*w2) and w will be multiplied by a factor of p1*p2. If we switch the order of these items, t will be increased by an increment of w * (w2 + p2*w1) and w will be multiplied by a factor of p1*p2. Clearly, we should perform the switch if (w1 + p1*w2) > (w2 + p2*w1), or equivalently after a little algebra, if w1/(1-p1) > w2/(1-p2). If w1/(1-p1) <= w2/(1-p2), we can say that these two elements of m are "correctly" ordered.
In the optimal ordering of m, there will be no pair of adjacent items worth switching; for any adjacent pair of (w1, p1) and (w2, p2), we will have w1/(1-p1) <= w2/(1-p2). Since the relation of having w1/(1-p1) <= w2/(1-p2) is the natural total ordering on the w/(1-p) values, the fact that w1/(1-p1) <= w2/(1-p2) holds for any pair of adjacent items means that the list is sorted by the w/(1-p) values.
Your attempted solution fails because it only considers what a pair of elements would do to the value of the tail of the array. It doesn't consider the fact that rather than using a low-p element now, to minimize the value of the tail, it might be better to save it for later, so you can apply that multiplier to more elements of m.
Note that the proof of our algorithm's validity relies on all p values being at least 0 and strictly less than 1. If p is 1, we can't divide by 1-p, and if p is greater than 1, dividing by 1-p reverses the direction of the inequality. These problems can be resolved using a comparator or a more sophisticated sort key. If p is less than 0, then w can switch sign, which reverses the logic of what items should be switched. Then we do need to know about the current state of the algorithm to decide which elements are better, and I'm not sure what to do then.

Closest Pair in Python

I am having a hard time modifying this code, I'm really new to python and I am trying to find the closest pair among the 10 input integers from a user. Here's my code so far and there is a syntax error...
a = [0,1,2,3,4,5,6,7,8,9]
a[0]=input()
a[1]=input()
a[2]=input()
a[3]=input()
a[4]=input()
a[5]=input()
a[6]=input()
a[7]=input()
a[8]=input()
a[9]=input()
a.sort()
b=sorted(set(a))
for item in enumerate(a):
for item1 in enumerate(b):
c = item - enumerate(b)
if c = item-1:
print item
print c
Thank,
Ai

Your code is causing exceptions because you're not handling the output of enumerate properly. Your item values are going to be (value, index) pairs, not single values, so there's no way to subtract them directly.
Here's another implementation, which may be something like what you were aiming for:
import itertools
def find_nearest_pair(lst):
min_pair = None
min_distance = float("inf")
for a, b in itertools.combinations(lst, 2): # generates all (a,b) pairs
distance = abs(a-b) # abs makes distance always non-negative
if distance < min_distance:
min_pair = (a,b)
min_distance = distance
return min_pair # you could return min_distance here too (or instead)
You could even compress it down further using the min function:
nearest_pair = min(itertools.combinations(lst, 2),
key=lambda item: abs(item[0]-item[1]))
Or if you just want the value:
nearest_pair_distance = min(abs(a-b) for a, b in itertools.combinations(lst, 2))

What are all these calls for enumerate for? That's for when you want to iterate through a collection and also keep a counter. You should remove all of them - but especially the one in the line c = item - enumerate(b) which makes absolutely no sense.
Once you've got it running, you should see you have a number of logic errors too, but I'll leave you to fix those yourself.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.