I have a set of attributes A= {a1, a2, ...an} and a set of clusters C = {c1, c2, ... ck} and I have a set of correspondences COR which is a subset of A x C and |COR|<< A x C. Here is a sample set of correspondences
COR = {(a1, c1), (a1, c2), (a2, c1), (a3, c3), (a4, c4)}
Now, I want to generate all the subsets of COR such that each pair in the subset represents an injective function from set A to set C. Let's call each of such subset a mapping then the valid mappings from the above set COR would be
m1 = {(a1, c1), (a3, c3), (a4, c4)} and m2 = {(a1, c2), (a2, c1), (a3, c3), (a4, c4)}
m1 is interesting here because adding any of the remaining elements from COR to m1 would either violate the definition of the function or it would violate the condition of being an injective function. For instance, if we add the pair (a1,c2) to m1, m1 would not be a function anymore and if we add (a2,c1) to m1, it will cease to be an injective function. So, I am interested in some code snippets or algorithm that I can use to generate all such mappings. Here is what I have tried so far in python
import collections
import itertools
corr = set({('a1', 'c1'), ('a1', 'c2'), ('a2', 'c1'), ('a3', 'c3'), ('a4', 'c4')})
clusters = [c[1] for c in corr]
attribs = [a[0] for a in corr]
rep_clusters = [item for item, count in collections.Counter(clusters).items() if count>1]
rep_attribs = [item for item, count in collections.Counter(attribs).items() if count>1]
conflicting_sets = []
for c in rep_clusters:
conflicting_sets.append([p for p in corr if p[1] == c])
for a in rep_attribs:
conflicting_sets.append([p for p in corr if p[0] == a])
non_conflicting = corr
for s in conflicting_sets:
non_conflicting = non_conflicting - set(s)
m = set()
for p in itertools.product(*conflicting_sets):
print(p, 'product', len(p))
p_attribs = set([k[0] for k in p])
p_clusters = set([k[1] for k in p])
print(len(p_attribs), len(p_clusters))
if len(p) == len(p_attribs) and len(p) == len(p_clusters):
m.add(frozenset(set(p).union(non_conflicting)))
print(m)
And as expected the code produces m2 but not m1 because m1 will not be generated from itertools.product. Can anyone guide me on this? I would also like some guidance on performance because the actual sets would be larger than COR set used here and may contain many more conflicting sets.
A simpler definition of your requirements is:
You have a set of unique tuples.
You want to generate all subsets for which:
all of the first elements of the tuples are unique (to ensure a function);
and all of the second elements are unique (to ensure injectivity).
Your title suggests you only want the maximal subsets, i.e. it must be impossible to add any additional elements from the original set without breaking the other requirements.
I'm also assuming any a<x> or c<y> is unique.
Here's a solution:
def get_maximal_subsets(corr):
def is_injective_function(f):
if not f:
return False
f_domain, f_range = zip(*f)
return len(set(f_domain)) - len(f_domain) + len(set(f_range)) - len(f_range) == 0
def generate_from(f):
if is_injective_function(f):
for r in corr - f:
if is_injective_function(f | {r}):
break
else:
yield f
else:
for c in f:
yield from generate_from(f - {c})
return list(map(set, set(map(frozenset, generate_from(corr)))))
# representing a's and c's as strings, as their actual value doesn't matter, as long as they are unique
print(get_maximal_subsets(corr={('a1', 'c1'), ('a1', 'c2'), ('a2', 'c1'), ('a3', 'c3'), ('a4', 'c4')}))
The test is_injective_function checks if the provided set f represents a valid injective function, by getting all the values from the domain and range of the function and checking that both only contain unique values.
The generator takes an f, and if it represents an injective valid function, it checks to see that none of the elements that have been removed from the original corr to reach f can be added back in while still having it represent an injective valid function. If that's the case, it yields f as a valid result.
If f isn't an injective valid function to begin with, it will try to remove each of the elements in f in turn and generate any injective valid functions from each of those subsets.
Finally, the whole function removes duplicates from the resulting generator and returns it as a list of unique sets.
Output:
[{('a1', 'c1'), ('a3', 'c3'), ('a4', 'c4')}, {('a2', 'c1'), ('a3', 'c3'), ('a4', 'c4'), ('a1', 'c2')}]
Note, there's several approaches to deduplicating a list of non-hashable values, but this approach turns all the sets in the list into a frozenset to make them hashable, then turns the list into a set to remove duplicates, then turns the contents into sets again and returns the result as a list.
You can prevent removing duplicates at the end by keeping track of what removed subsets have already been tried, which may perform better depending on your actual data set:
def get_maximal_subsets(corr):
def is_injective_function(f):
if not f:
return False
f_domain, f_range = zip(*f)
return len(set(f_domain)) - len(f_domain) + len(set(f_range)) - len(f_range) == 0
previously_removed = []
def generate_from(f, removed: set = None):
previously_removed.append(removed)
if removed is None:
removed = set()
if is_injective_function(f):
for r in removed:
if is_injective_function(f | {r}):
break
else:
yield f
else:
for c in f:
if removed | {c} not in previously_removed:
yield from generate_from(f - {c}, removed | {c})
return list(generate_from(corr))
This is probably a generally better performing solution, but I liked the clean algorithm of the first one better for explanation.
I was annoyed by the slowness of the above solution after the comment asking whether it scales up to 100 elements with ~15 conflicts (it would run for many minutes to solve it), so here's a faster solution that runs under 1 second for 100 elements with 15 conflicts, although the execution time still goes up exponentially, so it has its limits):
def injective_function_conflicts(f):
if not f:
return {}
conflicts = defaultdict(set)
# loop over the product f x f
for x in f:
for y in f:
# for each x and y that have a conflict in any position
if x != y and any(a == b for a, b in zip(x, y)):
# add x to y's entry and y to x's entry
conflicts[y].add(x)
conflicts[x].add(y)
return conflicts
def get_maximal_partial_subsets(conflicts, off_limits: set = None):
if off_limits is None:
off_limits = set()
while True and conflicts:
# pop elements from the conflicts, using them now, or discarding them if off-limits
k, vs = conflicts.popitem()
if k not in off_limits:
break
else:
# nothing left in conflicts that's not off-limits
yield set()
return
# generate each possible result from the rest of the conflicts, adding the conflicts vs for k to off_limits
for sub_result in get_maximal_partial_subsets(dict(conflicts), off_limits | vs):
# these results can have k added to them, as all the conflicts with k were off-limits
yield sub_result | {k}
# also generated each possible result from the rest of the conflicts without k's conflicts
for sub_result in get_maximal_partial_subsets(conflicts, off_limits):
# but only yield as a result if adding k itself to it would actually cause a conflict, avoiding duplicates
if sub_result and injective_function_conflicts(sub_result | {k}):
yield sub_result
def efficient_get_maximal_subsets(corr):
conflicts = injective_function_conflicts(corr)
final_result = list((corr - set(conflicts.keys())) | result
for result in get_maximal_partial_subsets(dict(conflicts)))
print(f'size of result and conflict: {len(final_result)}, {len(conflicts)}')
return final_result
print(efficient_get_maximal_subsets(corr={('a1', 'c1'), ('a1', 'c2'), ('a2', 'c1'), ('a3', 'c3'), ('a4', 'c4')}))
Related
I just started to use list comprehension and I'm struggling with it. In this case, I need to get the n number of each list (sequence_0 and sequence_1) that the iteration is at each time. How can I do that?
The idea is to get the longest sequence of equal nucleotides (a motif) between the two sequences. Once a pair is finded, the program should continue in the nexts nucleotides of the sequences, checking if they are also equal and then elonganting the motif with it. The final output should be an list of all the motifs finded.
The problem is, to continue in the next nucleotides once a pair is finded, i need the position of the pair in both sequences to the program continue. The index function does not work in this case, and that's why i need the enumerate.
Also, I don't understand exactly the reason for the x and y between (), it would be good to understand that too :)
just to explain, the content of the lists is DNA sequences, so its basically something like:
sequence_1 = ['A', 'T', 'C', 'A', 'C']
def find_shared_motif(arq):
data = fastaread(arq)
seqs = [list(sequence) for sequence in data.values()]
motifs = [[]]
i = 0
sequence_0, sequence_1 = seqs[0], seqs[1] # just to simplify
for x, y in [(x, y) for x in zip(sequence_0[::], sequence_0[1::]) for y in zip(sequence_1[::], sequence_1[1::])]:
print(f'Pairs {"".join(x)} and {"".join(y)} being analyzed...')
if x == y:
print(f'Pairs {"".join(x)} and {"".join(y)} match!')
motifs[i].append(x[0]), motifs[i].append(x[1])
k = sequence_0.index(x[0]) + 2 # NAO ESTA DEVOLVENDO O NUMERO CERTO
u = sequence_1.index(y[0]) + 2
print(k, u)
# Determines if the rest of the sequence is compatible
print(f'Starting to elongate the motif {x}...')
for j, m in enumerate(sequence_1[u::]):
try:
# Checks if the nucleotide is equal for both of the sequences
print(f'Analyzing the pair {sequence_0[k + j]}, {m}')
if m == sequence_0[k + j]:
motifs[i].append(m)
print(f'The pair {sequence_0[k + j]}, {m} is equal!')
# Stop in the first nonequal residue
else:
print(f'The pair {sequence_0[k + j]}, {m} is not equal.')
break
except IndexError:
print('IndexError, end of the string')
else:
i += 1
motifs.append([])
return motifs
...
One way to go with it is to start zipping both lists:
a = ['A', 'T', 'C', 'A', 'C']
b = ['A', 'T', 'C', 'C', 'T']
c = list(zip(a,b))
In that case, c will have the list of tuples below
c = [('A','A'), ('T','T'), ('C','C'), ('A','C'), ('C','T')]
Then, you can go with list comprehension and enumerate:
d = [(i, t) for i, t in enumerate(c)]
This will bring something like this to you:
d = [(0, ('A','A')), (1, ('T','T')), (2, ('C','C')), ...]
Of course you can go for a one-liner, if you want:
d = [(i, t) for i, t in enumerate(zip(a,b))]
>>> [(0, ('A','A')), (1, ('T','T')), (2, ('C','C')), ...]
Now, you have to deal with the nested tuples. Focus on the internal ones. It is obvious that what you want is to compare the first element of the tuples with the second ones. But, also, you will need the position where the difference resides (that lies outside). So, let's build a function for it. Inside the function, i will capture the positions, and t will capture the inner tuples:
def compare(a, b):
d = [(i, t) for i, t in enumerate(zip(a,b))]
for i, t in d:
if t[0] != t[1]:
return i
return -1
In that way, if you get -1 at the end, it means that all elements in both lists are equal, side by side. Otherwise, you will get the position of the first difference between them.
It is important to notice that, in the case of two lists with different sizes, the zip function will bring a list of tuples with the size matching the smaller of the lists. The extra elements of the other list will be ignored.
Ex.
list(zip([1,2], [3,4,5]))
>>> [(1,3), (2,4)]
You can use the function compare with your code to get the positions where the lists differ, and use that to build your motifs.
dict = {'a':['b1','b2', 'b3'], 'b':['b1','b2','b3'], 'c':['b1','b3','b4','b5']}
toList = list(dict.values())
os.path.commonprefix(toList)
os.path.commonprefix(toList) prints just ['b1'] but I'm trying to find the longest common prefix amongst any of the list of lists inputted, so ['b1', 'b2'] here. Another example:
[a,b,c],[a,c,c],[a,b] -> [a,b]
[a,c,d],[a,b,c],[a,d] -> [a]
*EDITED ORIGINAL QUESTION - realized os.path.commonprefix(toList) doesn't return any existing common prefix (like in my example), but the common prefix of all given lists inputted. Is there a library that does do what I want in my example?
You could build on your initial approach by adding itertools into the mix to find the commonprefix of all list combinations and then use max() to return the longest one. Note that this approach will only return one commonprefix so if there are multiple results that are of equal length and longer than all others it will only return one of them.
For example:
import itertools
import os
data = [['a','b','c'],['a','c','c'],['a','b']]
prefixes = [os.path.commonprefix([a, b]) for a, b in itertools.combinations(data, 2)]
longest = max(prefixes, key=len)
print(longest)
# OUTPUT
# ['a', 'b']
Made changes to the answer posted here
from itertools import takewhile,izip
x = [['b1', 'b2', 'b3'], ['b1', 'b3', 'b4', 'b5'], ['b1', 'b2', 'b3']]
flag_to_stop=False # flag to stop returning True
def allsame(x):
global flag_to_stop
if flag_to_stop:
return False
elif len(set(x)) == 1:
return True
elif len(set(x))>=1 and len(set(x))<len(x):
flag_to_stop=True #we have found the maximum common prefix. set flag_to_stop to True
return True
elif len(set(x))==len(x):
flag_to_stop=True
return False
[i[0] for i in takewhile(allsame ,izip(*x))]
Output
['b1', 'b2']
EDIT:
Say I have a string of nested parentheses as follows: ((AB)CD(E(FG)HI((J(K))L))) (assume the parantheses are balanced and enclose dproperly
How do i recursively remove the first set of fully ENCLOSED parantheses of every subset of fully enclosed parentheses?
So in this case would be (ABCD(E(FG)HI(JK)). (AB) would become AB because (AB) is the first set of closed parentheses in a set of closed parentheses (from (AB) to K)), E is also the first element of a set of parentheses but since it doesn't have parentheses nothing is changed, and (J) is the first element in the set ((J)K) and therefore the parentheses would be removed.
This is similar to building an expression tree and so far I have parsed it into a nested list and thought I can recursively check if the first element of every nested list isinstance(type(list)) but I don't know how?
The nested list is as follows:
arr = [['A', 'B'], 'C', 'D', ['E', ['F', 'G'], 'H', 'I', [['J'], 'K']]]
Perhaps convert it into:
arr = [A, B, C, D, [E, [F, G], H, I, [J, K]]
Is there a better way?
If I understood the question correctly, this ugly function should do the trick:
def rm_parens(s):
s2 = []
consec_parens = 0
inside_nested = False
for c in s:
if c == ')' and inside_nested:
inside_nested = False
consec_parens = 0
continue
if c == '(':
consec_parens += 1
else:
consec_parens = 0
if consec_parens == 2:
inside_nested = True
else:
s2.append(c)
s2 = ''.join(s2)
if s2 == s:
return s2
return rm_parens(s2)
s = '((AB)CD(E(FG)HI((J)K))'
s = rm_parens(s)
print(s)
Note that this function will call itself recursively until no consecutive parentheses exist. However, in your example, ((AB)CD(E(FG)HI((J)K)), a single call is enough to produce (ABCD(E(FG)HI(JK)).
You need to reduce your logic to something clear enough to program. What I'm getting from your explanations would look like the code below. Note that I haven't dealt with edge cases: you'll need to check for None elements, hitting the end of the list, etc.
def simplfy(parse_list):
# Find the first included list;
# build a new list with that set of brackets removed.
reduce_list = []
for pos in len(parse_list):
elem = parse_list[pos]
if isinstance(elem, list):
# remove brackets; construct new list
reduce_list = parse_list[:pos]
reduce_list.extend(elem)
reduce_list.extend(parse_list[pos+1:]
# Recur on each list element
return_list = []
for elem in parse_list
if isinstance(elem, list):
return_list.append(simplfy(elem))
else:
return_list.append(elem)
return return_list
I need to iterate a tree/graph and produce a certain output but following some rules:
_ d
/ / \
b c _e
/ / |
a f g
The expected output should be (order irrelevant):
{'bde', 'bcde', 'abde', 'abcde', 'bdfe', 'bdfge', 'abdfe', ...}
The rules are:
The top of the tree 'bde' (leftmost_root_children+root+rightmost_root_children) should always be present
The left-right order should be preserved so for example the combinations 'cb' or 'gf' are not allowed.
All paths follow the left to right direction.
I need to find all paths following these rules. Unfortunately I don't have a CS background and my head is exploding. Any tip will be helpful.
EDIT: This structure represents my tree very closely:
class N():
"""Node"""
def __init__(self, name, lefts, rights):
self.name = name
self.lefts = lefts
self.rights = rights
tree = N('d', [N('b', [N('a', [], [])], []), N('c', [], [])],
[N('e', [N('f', [], []), N('g', [], [])],
[])])
or may be more readable:
N('d', lefts =[N('b', lefts=[N('a', [], [])], rights=[]), N('c', [], [])],
rights=[N('e', lefts=[N('f', [], []), N('g', [], [])], rights=[])])
So this can be treated as a combination of two problems. My code below will assume the N class and tree structure have already been defined as in your problem statement.
First: given a tree structure like yours, how do you produce an in-order traversal of its nodes? This is a pretty straightforward problem, so I'll just show a simple recursive generator that solves it:
def inorder(node):
if not isinstance(node, list):
node = [node]
for n in node:
for left in inorder(getattr(n, 'lefts', [])):
yield left
yield n.name
for right in inorder(getattr(n, 'rights', [])):
yield right
print list(inorder(tree))
# ['a', 'b', 'c', 'd', 'f', 'g', 'e']
Second: Now that we have the "correct" ordering of the nodes, we next need to figure out all possible combinations of these that a) maintain this order, and b) contain the three "anchor" elements ('b', 'd', 'e'). This we can accomplish using some help from the always-handy itertools library.
The basic steps are:
Identify the anchor elements and partition the list into four pieces around them
Figure out all combinations of elements for each partition (i.e. the power set)
Take the product of all such combinations
Like so:
from itertools import chain, combinations
# powerset recipe taken from itertools documentation
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
def traversals(tree):
left, mid, right = tree.lefts[0].name, tree.name, tree.rights[0].name
nodes = list(inorder(tree))
l_i, m_i, r_i = [nodes.index(x) for x in (left, mid, right)]
parts = nodes[:l_i], nodes[l_i+1:m_i], nodes[m_i+1:r_i], nodes[r_i+1:]
psets = [powerset(x) for x in parts]
for p1, p2, p3, p4 in product(*psets):
yield ''.join(chain(p1, left, p2, mid, p3, right, p4))
print list(traversals(tree))
# ['bde', 'bdfe', 'bdge', 'bdfge', 'bcde', 'bcdfe',
# 'bcdge', 'bcdfge', 'abde', 'abdfe', 'abdge', 'abdfge',
# 'abcde', 'abcdfe', 'abcdge', 'abcdfge']
I have a list of tuples where each tuple is a (start-time, end-time). I am trying to merge all overlapping time ranges and return a list of distinct time ranges.
For example
[(1, 5), (2, 4), (3, 6)] ---> [(1,6)]
[(1, 3), (2, 4), (5, 8)] ---> [(1, 4), (5,8)]
Here is how I implemented it.
# Algorithm
# initialranges: [(a,b), (c,d), (e,f), ...]
# First we sort each tuple then whole list.
# This will ensure that a<b, c<d, e<f ... and a < c < e ...
# BUT the order of b, d, f ... is still random
# Now we have only 3 possibilities
#================================================
# b<c<d: a-------b Ans: [(a,b),(c,d)]
# c---d
# c<=b<d: a-------b Ans: [(a,d)]
# c---d
# c<d<b: a-------b Ans: [(a,b)]
# c---d
#================================================
def mergeoverlapping(initialranges):
i = sorted(set([tuple(sorted(x)) for x in initialranges]))
# initialize final ranges to [(a,b)]
f = [i[0]]
for c, d in i[1:]:
a, b = f[-1]
if c<=b<d:
f[-1] = a, d
elif b<c<d:
f.append((c,d))
else:
# else case included for clarity. Since
# we already sorted the tuples and the list
# only remaining possibility is c<d<b
# in which case we can silently pass
pass
return f
I am trying to figure out if
Is the a an built-in function in some python module that can do this more efficiently? or
Is there a more pythonic way of accomplishing the same goal?
Your help is appreciated. Thanks!
A few ways to make it more efficient, Pythonic:
Eliminate the set() construction, since the algorithm should prune out duplicates during in the main loop.
If you just need to iterate over the results, use yield to generate the values.
Reduce construction of intermediate objects, for example: move the tuple() call to the point where the final values are produced, saving you from having to construct and throw away extra tuples, and reuse a list saved for storing the current time range for comparison.
Code:
def merge(times):
saved = list(times[0])
for st, en in sorted([sorted(t) for t in times]):
if st <= saved[1]:
saved[1] = max(saved[1], en)
else:
yield tuple(saved)
saved[0] = st
saved[1] = en
yield tuple(saved)
data = [
[(1, 5), (2, 4), (3, 6)],
[(1, 3), (2, 4), (5, 8)]
]
for times in data:
print list(merge(times))
Sort tuples then list, if t1.right>=t2.left => merge
and restart with the new list, ...
-->
def f(l, sort = True):
if sort:
sl = sorted(tuple(sorted(i)) for i in l)
else:
sl = l
if len(sl) > 1:
if sl[0][1] >= sl[1][0]:
sl[0] = (sl[0][0], sl[1][1])
del sl[1]
if len(sl) < len(l):
return f(sl, False)
return sl
The sort part: use standard sorting, it compares tuples the right way already.
sorted_tuples = sorted(initial_ranges)
The merge part. It eliminates duplicate ranges, too, so no need for a set. Suppose you have current_tuple and next_tuple.
c_start, c_end = current_tuple
n_start, n_end = next_tuple
if n_start <= c_end:
merged_tuple = min(c_start, n_start), max(c_end, n_end)
I hope the logic is clear enough.
To peek next tuple, you can use indexed access to sorted tuples; it's a wholly known sequence anyway.
Sort all boundaries then take all pairs where a boundary end is followed by a boundary start.
def mergeOverlapping(initialranges):
def allBoundaries():
for r in initialranges:
yield r[0], True
yield r[1], False
def getBoundaries(boundaries):
yield boundaries[0][0]
for i in range(1, len(boundaries) - 1):
if not boundaries[i][1] and boundaries[i + 1][1]:
yield boundaries[i][0]
yield boundaries[i + 1][0]
yield boundaries[-1][0]
return getBoundaries(sorted(allBoundaries()))
Hm, not that beautiful but was fun to write at least!
EDIT: Years later, after an upvote, I realised my code was wrong! This is the new version just for fun:
def mergeOverlapping(initialRanges):
def allBoundaries():
for r in initialRanges:
yield r[0], -1
yield r[1], 1
def getBoundaries(boundaries):
openrange = 0
for value, boundary in boundaries:
if not openrange:
yield value
openrange += boundary
if not openrange:
yield value
def outputAsRanges(b):
while b:
yield (b.next(), b.next())
return outputAsRanges(getBoundaries(sorted(allBoundaries())))
Basically I mark the boundaries with -1 or 1 and then sort them by value and only output them when the balance between open and closed braces is zero.
Late, but might help someone looking for this. I had a similar problem but with dictionaries. Given a list of time ranges, I wanted to find overlaps and merge them when possible. A little modification to #samplebias answer led me to this:
Merge function:
def merge_range(ranges: list, start_key: str, end_key: str):
ranges = sorted(ranges, key=lambda x: x[start_key])
saved = dict(ranges[0])
for range_set in sorted(ranges, key=lambda x: x[start_key]):
if range_set[start_key] <= saved[end_key]:
saved[end_key] = max(saved[end_key], range_set[end_key])
else:
yield dict(saved)
saved[start_key] = range_set[start_key]
saved[end_key] = range_set[end_key]
yield dict(saved)
Data:
data = [
{'start_time': '09:00:00', 'end_time': '11:30:00'},
{'start_time': '15:00:00', 'end_time': '15:30:00'},
{'start_time': '11:00:00', 'end_time': '14:30:00'},
{'start_time': '09:30:00', 'end_time': '14:00:00'}
]
Execution:
print(list(merge_range(ranges=data, start_key='start_time', end_key='end_time')))
Output:
[
{'start_time': '09:00:00', 'end_time': '14:30:00'},
{'start_time': '15:00:00', 'end_time': '15:30:00'}
]
When using Python 3.7, following the suggestion given by “RuntimeError: generator raised StopIteration” every time I try to run app, the method outputAsRanges from #UncleZeiv should be:
def outputAsRanges(b):
while b:
try:
yield (next(b), next(b))
except StopIteration:
return