Create a Sequence With Two of Each Element Satisfying Certain Distance Criteria - python

Given a sequence of distinct items Sa, we wish to create a sequence Sb (composed of the same items in Sa, but in a different order) such that the sequence S = Sa + Sb (sequence Sb appended immediately after sequence Sa) satisfies the following properties:
The distance (number of positions) between the two occurrences of item I in S is at least some number T for all items I.
If items I and J are within N positions in Sa, then I and J are not within N positions in Sb.
I've been able to program the first stipulation in Python fairly simply. However, the second one is where I struggle. Essentially, I'm just wanting these two things:
I want the second sequence to have its items "far away enough" from their occurrence in the first sequence.
I don't want neighbors of the first sequence to also be neighbors in the second sequence (with N referring to the distance in which items are considered neighbors).
Here's what I have so far:
import random
clips = list(range(10)) # arbitrary items
choice_pool = clips[:]
Sa = clips[:]
random.shuffle(Sa)
Sb = []
count = len(Sa)
threshold = 0.5*len(clips) # the minimum distance the item has to be away from itself in the concatenated sequence
while len(Sb) != len(Sa):
jj = random.randint(0, len(choice_pool) - 1)
# we want clip a1 to be at least threshold away from clip b1
if count - Sa.index(choice_pool[jj]) >= threshold:
Sb.append(choice_pool[jj])
del choice_pool[jj]
count += 1
print("Sa:", Sa)
print("Sb:", Sb)
print("S :", Sa + Sb)
Do you have any advice on how to also achieve the second stipulation, while always guaranteeing such a sequence exists (not ending up in an infinite loop)? Thanks.

I would take out the chance of randomness out of the equation. With randomness you are never guaranteed that you aren't stuck in an infinite loop. There are better improvements to this algorithm but here is the base.
import itertools as it
def filter_criteria(sequence):
#put your filters here return True if you find a sequence that works
pass
for sb_try in it.permutations(sa, len(sa)):
if filter_criteria(sa+sb_try):
return sb_try
return "no permutation matches"

Related

How do I fix code that calculates the amount of combinations in the partitions of a set?

I am working on a code in Python 2 that partitions a set of 13 elements using integer partitions, then evaluating the different combinations they can have (order does not matter). I have seen the ways people do this by using recursive functions to calculate every partition in a set retroactively, but for what I'm working on I'm taking a different approach.
I'm working with the logic that the different ways a set can be partitioned is determined by the integer partitions of a set. For a set of 4 elements, it can be partitioned in these ways:
[1,1,1,1]
[1,1,2]
[2,2]
[1,3]
[4]
Every number stands for the length of a subset in the partition. Using this info, I can then calculate all of the combinations that can be used with these different integer partitions. If I add the number of combinations from each partition together, I should receive the Bell number (the number of possible partitions in a set). For a list of 4 elements, the Bell number should be 15.
My code runs through the subset lengths in each partition, sets the length of the set to n and the subset length to r, then calculates the combinations in the specific subset. When it goes to the next subset, it subtracts the previous r from n to account for it lessening the amount of combinations available, as n gets smaller when a subset is already defined.
My code, however, is lackluster. When inputting 4 as the length of the set, it outputs 16 (instead of 15). When inputting 5, it outputs 48 (instead of 52). When inputting 13, it outputs 102,513 (instead of 27,644,437). I need it to be exact rather than an estimate.
This is in part because of if elem != 1: not properly accounting for a list of all ones or a list of one subset. It's also in part because it doesn't account for repeats of a combination when appearing in a subset. In [2,2] for a list of 4 elements, it considers the subset to contain 6 combinations when in reality it contains 3.
I'm stuck on how to solve this issue, as I only know enough Python to get by. The way the code currently outputs is how I prefer it to output, obviously without the errors.
The recursive function that calculates the integer partitions is from Nicolas Blanc, and the rest was coded by myself. Important links: Bell number, Partition of a set
import math
in_par = []
stack = []
bell = 0
def partitions(remainder, start_number = 1):
if remainder == 0:
in_par.append(list(stack))
#print stack
else:
for nb_to_add in range(start_number, remainder+1):
stack.append(nb_to_add)
partitions(remainder - nb_to_add, nb_to_add)
stack.pop()
x = partitions(13) # <------- input element count here
for part in in_par:
part.reverse()
combinations = 0
n = 13 # <------- input element count here
for i,elem in enumerate(part):
r = elem
combo = 0
if elem != 1:
if i != (len(part) - 1):
combo = math.factorial(n) / (math.factorial(r) * math.factorial(n-r))
n = n - elem
combinations = combinations + combo
bell = bell + combinations
part.append([combinations])
print part
#print str(bell)
print "Bell Number: " + str(bell)

Fastest way to sample most numbers with minimum difference larger than a value from a Python list

Given a list of 20 float numbers, I want to find a largest subset where any two of the candidates are different from each other larger than a mindiff = 1.. Right now I am using a brute-force method to search from largest to smallest subsets using itertools.combinations. As shown below, the code finds a subset after 4 s for a list of 20 numbers.
from itertools import combinations
import random
from time import time
mindiff = 1.
length = 20
random.seed(99)
lst = [random.uniform(1., 10.) for _ in range(length)]
t0 = time()
n = len(lst)
sample = []
found = False
while not found:
# get all subsets with size n
subsets = list(combinations(lst, n))
# shuffle to ensure randomness
random.shuffle(subsets)
for subset in subsets:
# sort the subset numbers
ss = sorted(subset)
# calculate the differences between every two adjacent numbers
diffs = [j-i for i, j in zip(ss[:-1], ss[1:])]
if min(diffs) > mindiff:
sample = set(subset)
found = True
break
# check subsets with size -1
n -= 1
print(sample)
print(time()-t0)
Output:
{2.3704888087015568, 4.365818049020534, 5.403474619948962, 6.518944556233767, 7.8388969285727015, 9.117993839791751}
4.182451486587524
However, in reality I have a list of 200 numbers, which is infeasible for a brute-froce enumeration. I want a fast algorithm to sample just one random largest subset with a minimum difference larger than 1. Note that I want each sample has randomness and maximum size. Any suggestions?
My previous answer assumed you simply wanted a single optimal solution, not a uniform random sample of all solutions. This answer assumes you want one that samples uniformly from all such optimal solutions.
Construct a directed acyclic graph G where there is one node for each point, and nodes a and b are connected when b - a > mindist. Also add two virtual nodes, s and t, where s -> x for all x and x -> t for all x.
Calculate for each node in G how many paths of length k exist to t. You can do this efficiently in O(n^2 k) time using dynamic programming with a table P[x][k], filling initially P[x][0] = 0 except P[t][0] = 1, and then P[x][k] = sum(P[y][k-1] for y in neighbors(x)).
Keep doing this until you reach the maximum k - you now know the size of the optimal subset.
Uniformly sample a path of length k from s to t using P to weight your choices.
This is done by starting at s. We then look at each neighbor of s and choose one randomly with a weighting dictated by P[s][k]. This gives us our first element of the optimal set.
We then repeatedly perform this step. We are at x, look at the neighbors of x and pick one randomly using weights P[x][k-i] where i is the step we're at.
Use the nodes you sampled in 3 as your random subset.
An implementation of the above in pure Python:
import random
def sample_mindist_subset(xs, mindist):
# Construct directed graph G.
n = len(xs)
s = n; t = n + 1 # Two virtual nodes, source and sink.
neighbors = {
i: [t] + [j for j in range(n) if xs[j] - xs[i] > mindist]
for i in range(n)}
neighbors[s] = [t] + list(range(n))
neighbors[t] = []
# Compute number of paths P[x][k] from x to t of length k.
P = [[0 for _ in range(n+2)] for _ in range(n+2)]
P[t][0] = 1
for k in range(1, n+2):
for x in range(n+2):
P[x][k] = sum(P[y][k-1] for y in neighbors[x])
# Sample maximum length path uniformly at random.
maxk = max(k for k in range(n+2) if P[s][k] > 0)
path = [s]
while path[-1] != t:
candidates = neighbors[path[-1]]
weights = [P[cn][maxk-len(path)] for cn in candidates]
path.append(random.choices(candidates, weights)[0])
return [xs[i] for i in path[1:-1]]
Note that if you want to sample from the same set of numbers many times, you don't have to recompute P every single time and can re-use it.
I probably don't fully understand the question, because right now the solution is quite trivial. EDIT: yes, I misunderstood after all, the OP does not just want an optimal solution, but wishes to randomly sample from the set of optimal solutions. This answer is not incorrect but it also is an answer to a different question than what OP is interested in.
Simply sort the numbers and greedily construct the subset:
def mindist_subset(xs, mindist):
result = []
for x in sorted(xs):
if not result or x - result[-1] > mindist:
result.append(x)
return result
Sketch of proof of correctness.
Suppose we have a solution S given input array A that is of optimal size. If it does not contain min(A) note that we could remove min(S) from S and add min(A) since this would only increase the distance between min(S) and the second smallest number in S. Conclusion: we can without loss of generality assume that min(A) is part of an optimal solution.
Now we can apply this argument recursively. We add min(A) to a solution and remove all elements too close to min(A), giving remaining elements A'. Then we're left with a subproblem where exactly the same argument applies, we can choose min(A') as our next element of the solution, etc.

How to apply a backtracking algorithm?

I'm doing some excercises in Python course and one of them where I'm stuck is below:
Given a digit sequence that represents a message where each uppercase letter
is replaced with a number (A - 1, B - 2, ... , Z - 26) and space - 0.
Find the number of the initial messages, from which that sequence
could be obtained.
Example: 12345 - 3 (ABCDE, LCDE, AWDE)
11 - 2 (AA, K)
The naive solution is easy and it is simple bruteforce algorithm:
import string
def count_init_messages(sequence):
def get_alpha(seq):
nonlocal count
if len(seq) == 0:
count += 1
return
for i in range(1, len(seq) + 1):
if seq[:i] not in alph_table:
break
else:
get_alpha(seq[i:])
alphabet = " " + string.ascii_uppercase
# generate dictionary of possible digit combination
alph_table = {str(n): alph for n, alph in zip(range(len(alphabet)), alphabet)}
# counter for the right combination met
count = 0
get_alpha(sequence)
return count
def main():
sequence = input().rstrip()
print(count_init_messages2(sequence))
if __name__ == "__main__":
main()
But as the length of an input sequence might be as long as 100 characters and there might be lots of repetition I have met a time limits. For example, one of the sample input is 2222222222222222222222222222222222222222222222222222222222222222222222 (possible messages number is 308061521170129). As my implementation makes too many repetition it takes ages for processing such an input. I think of using the backtracking algorithm, but I haven't realised yet how to implement the memoization for the succesive results.
I'd be glad if it is possible to point me out to the right way how to break that task.
The recurrence relation you have to solve (where s is a string of digits, and a and b are single digits) is this:
S("") = 1
S(a) = 1
S(s + a + b) = S(s+a) + (S(s) if ab is between 10 and 26)
That can be computed using dynamic programming rather than backtracking. If you do it right, it's O(n) time complexity, and O(1) space complexity.
def seq(s):
a1, a2 = 1, 1
for i in xrange(1, len(s)):
a1, a2 = a1 + (a2 if 9 < int(s[i-1:i+1]) < 27 else 0), a1
return a1
print seq('2222222222222222222222222222222222222222222222222222222222222222222222')
The largest number in the lookup table is 26 so you never need to lookup strings of lengths greater than 2. Modify the for loop accordingly. That might be enough to make brute force viable.
You may have also recognized 308061521170129 as the 71st Fibonacci number. This relationship corresponds with the Fibonacci numbers giving "the solution to certain enumerative problems. The most common such problem is that of counting the number of compositions of 1s and 2s that sum to a given total n: there are Fn+1 ways to do this" (https://en.wikipedia.org/wiki/Fibonacci_number#Use_in_mathematics).
Every contiguous subsequence in the string that can be divided into either single or double digit codes represents such an n with multiple possible compositions of 1s and 2s; and thus, for every such subsequence within the string, the result must be multiplied by the (subsequence's length + 1) Fibonacci number (in the case of the 70 2's, we just multiply 1 by the 71st Fibonacci number).

Improving the execution time of matrix calculations in Python

I work with a large amount of data and the execution time of this piece of code is very very important. The results in each iteration are interdependent, so it's hard to make it in parallel. It would be awesome if there is a faster way to implement some parts of this code, like:
finding the max element in the matrix and its indices
changing the values in a row/column with the max from another row/column
removing a specific row and column
Filling the weights matrix is pretty fast.
The code does the following:
it contains a list of lists of words word_list, with count elements in it. At the beginning each word is a separate list.
it contains a two dimensional list (count x count) of float values weights (lower triangular matrix, the values for which i>=j are zeros)
in each iteration it does the following:
it finds the two words with the most similar value (the max element in the matrix and its indices)
it merges their row and column, saving the larger value from the two in each cell
it merges the corresponding word lists in word_list. It saves both lists in the one with the smaller index (max_j) and it removes the one with the larger index (max_i).
it stops if the largest value is less then a given THRESHOLD
I might think of a different algorithm to do this task, but I have no ideas for now and it would be great if there is at least a small performance improvement.
I tried using NumPy but it performed worse.
weights = fill_matrix(count, N, word_list)
while 1:
# find the max element in the matrix and its indices
max_element = 0
for i in range(count):
max_e = max(weights[i])
if max_e > max_element:
max_element = max_e
max_i = i
max_j = weights[i].index(max_e)
if max_element < THRESHOLD:
break
# reset the value of the max element
weights[max_i][max_j] = 0
# here it is important that always max_j is less than max i (since it's a lower triangular matrix)
for j in range(count):
weights[max_j][j] = max(weights[max_i][j], weights[max_j][j])
for i in range(count):
weights[i][max_j] = max(weights[i][max_j], weights[i][max_i])
# compare the symmetrical elements, set the ones above to 0
for i in range(count):
for j in range(count):
if i <= j:
if weights[i][j] > weights[j][i]:
weights[j][i] = weights[i][j]
weights[i][j] = 0
# remove the max_i-th column
for i in range(len(weights)):
weights[i].pop(max_i)
# remove the max_j-th row
weights.pop(max_i)
new_list = word_list[max_j]
new_list += word_list[max_i]
word_list[max_j] = new_list
# remove the element that was recently merged into a cluster
word_list.pop(max_i)
count -= 1
This might help:
def max_ij(A):
t1 = [max(list(enumerate(row)), key=lambda r: r[1]) for row in A]
t2 = max(list(enumerate(t1)), key=lambda r:r[1][1])
i, (j, max_) = t2
return max_, i, j
It depends on how much work you want to put into it but if you're really concerned about speed you should look into Cython. The quick start tutorial gives a few examples ranging from a 35% speedup to an amazing 150x speedup (with some added effort on your part).

longest common sub-string, Python complexity analysis

I built a function which finds the longest common sub-string of two text files in ascending order based on Rabin–Karp algorithm.
the main function is "find_longest" and the inner functions are: "make_hashtable","extend_fingerprints" and "has_match".
I'm having trouble analyzing the average case complexity of has_match.
Denote n1,n2 as text1,text2 and l as the size of the currunt "window".
fingers are the hash table of the substring.
def has_match(text1,text2,fingers1,fingers2,l,r):
h = make_hashtable(fingers2,r)
for i in range(len(fingers1)):
for j in h[fingers1[i]]:
if text1[i:i+l] == text2[j:j+l]:
return text1[i:i+l]
return None
this is "make_hashtable", here I'm pretty sure that the complexcity is O(n2-l+1):
def make_hashtable(fingers, table_size):
hash_table=[[] for i in range(table_size)]
count=0
for f in fingers:
hash_table[f].append(count)
count+=1
return hash_table
this is "find_longest", im adding this function despite the fact that i dont need it for the complexity analyzing.
def find_longest(text1,text2,basis=2**8,r=2**17-1):
match = ''
l = 0 #initial "window" size
#fingerprints of "windows" of size 0 - all are 0
fingers1 = [0]*(len(text1)+1)
fingers2 = [0]*(len(text2)+1)
while match != None: #there was a common substring of len l
l += 1
extend_fingerprints(text1, fingers1, l, basis, r)
extend_fingerprints(text2, fingers2, l, basis, r)
match = has_match(text1,text2,fingers1,fingers2,l,r)
print(match)
return l-1
and this is "extend_fingerprints":
def extend_fingerprints(text, fingers, length, basis=2**8, r=2**17-1):
count=0
for f in fingers:
if count==len(fingers)-1:
fingers.pop(len(fingers)-1)
break
fingers[count]=(f*basis+ord(text[length-1+count]))%r
count+=1
I'm having doubts between this two options:
1.O(n_2-l+1)+O(n_1-l+1)*O(l)
Refer to r as a constant number while n1,n2 are very large therefore a lot of collisions would be made at the hash table (let's say O(1) items at every 'cell', yet, always some "false-positives")
2.O(n_2-l+1)+O(n_1-l+1)+O(l)
Refer to r as optimal for a decent hash function, therefore almost no collisions which means that if two texts are the same cell at the hash table we may assume they are actually the same text?
Personally I lean towards the Bold statement.
tnx.
I think the answer is
O((n_2-l) + l*(n_1-l))
.
(n_2-l) represents the complexity of make_hashtable for the second text.
l*(n_1-l) represents the two nested loops who go through every item in the finger prints of the first text and perform 1 comparison operation (for l length slice), for some constant 'm' if there are some items of the same index in the hash table.

Categories

Resources