I need to iterate a tree/graph and produce a certain output but following some rules:
_ d
/ / \
b c _e
/ / |
a f g
The expected output should be (order irrelevant):
{'bde', 'bcde', 'abde', 'abcde', 'bdfe', 'bdfge', 'abdfe', ...}
The rules are:
The top of the tree 'bde' (leftmost_root_children+root+rightmost_root_children) should always be present
The left-right order should be preserved so for example the combinations 'cb' or 'gf' are not allowed.
All paths follow the left to right direction.
I need to find all paths following these rules. Unfortunately I don't have a CS background and my head is exploding. Any tip will be helpful.
EDIT: This structure represents my tree very closely:
class N():
"""Node"""
def __init__(self, name, lefts, rights):
self.name = name
self.lefts = lefts
self.rights = rights
tree = N('d', [N('b', [N('a', [], [])], []), N('c', [], [])],
[N('e', [N('f', [], []), N('g', [], [])],
[])])
or may be more readable:
N('d', lefts =[N('b', lefts=[N('a', [], [])], rights=[]), N('c', [], [])],
rights=[N('e', lefts=[N('f', [], []), N('g', [], [])], rights=[])])
So this can be treated as a combination of two problems. My code below will assume the N class and tree structure have already been defined as in your problem statement.
First: given a tree structure like yours, how do you produce an in-order traversal of its nodes? This is a pretty straightforward problem, so I'll just show a simple recursive generator that solves it:
def inorder(node):
if not isinstance(node, list):
node = [node]
for n in node:
for left in inorder(getattr(n, 'lefts', [])):
yield left
yield n.name
for right in inorder(getattr(n, 'rights', [])):
yield right
print list(inorder(tree))
# ['a', 'b', 'c', 'd', 'f', 'g', 'e']
Second: Now that we have the "correct" ordering of the nodes, we next need to figure out all possible combinations of these that a) maintain this order, and b) contain the three "anchor" elements ('b', 'd', 'e'). This we can accomplish using some help from the always-handy itertools library.
The basic steps are:
Identify the anchor elements and partition the list into four pieces around them
Figure out all combinations of elements for each partition (i.e. the power set)
Take the product of all such combinations
Like so:
from itertools import chain, combinations
# powerset recipe taken from itertools documentation
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
def traversals(tree):
left, mid, right = tree.lefts[0].name, tree.name, tree.rights[0].name
nodes = list(inorder(tree))
l_i, m_i, r_i = [nodes.index(x) for x in (left, mid, right)]
parts = nodes[:l_i], nodes[l_i+1:m_i], nodes[m_i+1:r_i], nodes[r_i+1:]
psets = [powerset(x) for x in parts]
for p1, p2, p3, p4 in product(*psets):
yield ''.join(chain(p1, left, p2, mid, p3, right, p4))
print list(traversals(tree))
# ['bde', 'bdfe', 'bdge', 'bdfge', 'bcde', 'bcdfe',
# 'bcdge', 'bcdfge', 'abde', 'abdfe', 'abdge', 'abdfge',
# 'abcde', 'abcdfe', 'abcdge', 'abcdfge']
Related
I just started to use list comprehension and I'm struggling with it. In this case, I need to get the n number of each list (sequence_0 and sequence_1) that the iteration is at each time. How can I do that?
The idea is to get the longest sequence of equal nucleotides (a motif) between the two sequences. Once a pair is finded, the program should continue in the nexts nucleotides of the sequences, checking if they are also equal and then elonganting the motif with it. The final output should be an list of all the motifs finded.
The problem is, to continue in the next nucleotides once a pair is finded, i need the position of the pair in both sequences to the program continue. The index function does not work in this case, and that's why i need the enumerate.
Also, I don't understand exactly the reason for the x and y between (), it would be good to understand that too :)
just to explain, the content of the lists is DNA sequences, so its basically something like:
sequence_1 = ['A', 'T', 'C', 'A', 'C']
def find_shared_motif(arq):
data = fastaread(arq)
seqs = [list(sequence) for sequence in data.values()]
motifs = [[]]
i = 0
sequence_0, sequence_1 = seqs[0], seqs[1] # just to simplify
for x, y in [(x, y) for x in zip(sequence_0[::], sequence_0[1::]) for y in zip(sequence_1[::], sequence_1[1::])]:
print(f'Pairs {"".join(x)} and {"".join(y)} being analyzed...')
if x == y:
print(f'Pairs {"".join(x)} and {"".join(y)} match!')
motifs[i].append(x[0]), motifs[i].append(x[1])
k = sequence_0.index(x[0]) + 2 # NAO ESTA DEVOLVENDO O NUMERO CERTO
u = sequence_1.index(y[0]) + 2
print(k, u)
# Determines if the rest of the sequence is compatible
print(f'Starting to elongate the motif {x}...')
for j, m in enumerate(sequence_1[u::]):
try:
# Checks if the nucleotide is equal for both of the sequences
print(f'Analyzing the pair {sequence_0[k + j]}, {m}')
if m == sequence_0[k + j]:
motifs[i].append(m)
print(f'The pair {sequence_0[k + j]}, {m} is equal!')
# Stop in the first nonequal residue
else:
print(f'The pair {sequence_0[k + j]}, {m} is not equal.')
break
except IndexError:
print('IndexError, end of the string')
else:
i += 1
motifs.append([])
return motifs
...
One way to go with it is to start zipping both lists:
a = ['A', 'T', 'C', 'A', 'C']
b = ['A', 'T', 'C', 'C', 'T']
c = list(zip(a,b))
In that case, c will have the list of tuples below
c = [('A','A'), ('T','T'), ('C','C'), ('A','C'), ('C','T')]
Then, you can go with list comprehension and enumerate:
d = [(i, t) for i, t in enumerate(c)]
This will bring something like this to you:
d = [(0, ('A','A')), (1, ('T','T')), (2, ('C','C')), ...]
Of course you can go for a one-liner, if you want:
d = [(i, t) for i, t in enumerate(zip(a,b))]
>>> [(0, ('A','A')), (1, ('T','T')), (2, ('C','C')), ...]
Now, you have to deal with the nested tuples. Focus on the internal ones. It is obvious that what you want is to compare the first element of the tuples with the second ones. But, also, you will need the position where the difference resides (that lies outside). So, let's build a function for it. Inside the function, i will capture the positions, and t will capture the inner tuples:
def compare(a, b):
d = [(i, t) for i, t in enumerate(zip(a,b))]
for i, t in d:
if t[0] != t[1]:
return i
return -1
In that way, if you get -1 at the end, it means that all elements in both lists are equal, side by side. Otherwise, you will get the position of the first difference between them.
It is important to notice that, in the case of two lists with different sizes, the zip function will bring a list of tuples with the size matching the smaller of the lists. The extra elements of the other list will be ignored.
Ex.
list(zip([1,2], [3,4,5]))
>>> [(1,3), (2,4)]
You can use the function compare with your code to get the positions where the lists differ, and use that to build your motifs.
I have a set of attributes A= {a1, a2, ...an} and a set of clusters C = {c1, c2, ... ck} and I have a set of correspondences COR which is a subset of A x C and |COR|<< A x C. Here is a sample set of correspondences
COR = {(a1, c1), (a1, c2), (a2, c1), (a3, c3), (a4, c4)}
Now, I want to generate all the subsets of COR such that each pair in the subset represents an injective function from set A to set C. Let's call each of such subset a mapping then the valid mappings from the above set COR would be
m1 = {(a1, c1), (a3, c3), (a4, c4)} and m2 = {(a1, c2), (a2, c1), (a3, c3), (a4, c4)}
m1 is interesting here because adding any of the remaining elements from COR to m1 would either violate the definition of the function or it would violate the condition of being an injective function. For instance, if we add the pair (a1,c2) to m1, m1 would not be a function anymore and if we add (a2,c1) to m1, it will cease to be an injective function. So, I am interested in some code snippets or algorithm that I can use to generate all such mappings. Here is what I have tried so far in python
import collections
import itertools
corr = set({('a1', 'c1'), ('a1', 'c2'), ('a2', 'c1'), ('a3', 'c3'), ('a4', 'c4')})
clusters = [c[1] for c in corr]
attribs = [a[0] for a in corr]
rep_clusters = [item for item, count in collections.Counter(clusters).items() if count>1]
rep_attribs = [item for item, count in collections.Counter(attribs).items() if count>1]
conflicting_sets = []
for c in rep_clusters:
conflicting_sets.append([p for p in corr if p[1] == c])
for a in rep_attribs:
conflicting_sets.append([p for p in corr if p[0] == a])
non_conflicting = corr
for s in conflicting_sets:
non_conflicting = non_conflicting - set(s)
m = set()
for p in itertools.product(*conflicting_sets):
print(p, 'product', len(p))
p_attribs = set([k[0] for k in p])
p_clusters = set([k[1] for k in p])
print(len(p_attribs), len(p_clusters))
if len(p) == len(p_attribs) and len(p) == len(p_clusters):
m.add(frozenset(set(p).union(non_conflicting)))
print(m)
And as expected the code produces m2 but not m1 because m1 will not be generated from itertools.product. Can anyone guide me on this? I would also like some guidance on performance because the actual sets would be larger than COR set used here and may contain many more conflicting sets.
A simpler definition of your requirements is:
You have a set of unique tuples.
You want to generate all subsets for which:
all of the first elements of the tuples are unique (to ensure a function);
and all of the second elements are unique (to ensure injectivity).
Your title suggests you only want the maximal subsets, i.e. it must be impossible to add any additional elements from the original set without breaking the other requirements.
I'm also assuming any a<x> or c<y> is unique.
Here's a solution:
def get_maximal_subsets(corr):
def is_injective_function(f):
if not f:
return False
f_domain, f_range = zip(*f)
return len(set(f_domain)) - len(f_domain) + len(set(f_range)) - len(f_range) == 0
def generate_from(f):
if is_injective_function(f):
for r in corr - f:
if is_injective_function(f | {r}):
break
else:
yield f
else:
for c in f:
yield from generate_from(f - {c})
return list(map(set, set(map(frozenset, generate_from(corr)))))
# representing a's and c's as strings, as their actual value doesn't matter, as long as they are unique
print(get_maximal_subsets(corr={('a1', 'c1'), ('a1', 'c2'), ('a2', 'c1'), ('a3', 'c3'), ('a4', 'c4')}))
The test is_injective_function checks if the provided set f represents a valid injective function, by getting all the values from the domain and range of the function and checking that both only contain unique values.
The generator takes an f, and if it represents an injective valid function, it checks to see that none of the elements that have been removed from the original corr to reach f can be added back in while still having it represent an injective valid function. If that's the case, it yields f as a valid result.
If f isn't an injective valid function to begin with, it will try to remove each of the elements in f in turn and generate any injective valid functions from each of those subsets.
Finally, the whole function removes duplicates from the resulting generator and returns it as a list of unique sets.
Output:
[{('a1', 'c1'), ('a3', 'c3'), ('a4', 'c4')}, {('a2', 'c1'), ('a3', 'c3'), ('a4', 'c4'), ('a1', 'c2')}]
Note, there's several approaches to deduplicating a list of non-hashable values, but this approach turns all the sets in the list into a frozenset to make them hashable, then turns the list into a set to remove duplicates, then turns the contents into sets again and returns the result as a list.
You can prevent removing duplicates at the end by keeping track of what removed subsets have already been tried, which may perform better depending on your actual data set:
def get_maximal_subsets(corr):
def is_injective_function(f):
if not f:
return False
f_domain, f_range = zip(*f)
return len(set(f_domain)) - len(f_domain) + len(set(f_range)) - len(f_range) == 0
previously_removed = []
def generate_from(f, removed: set = None):
previously_removed.append(removed)
if removed is None:
removed = set()
if is_injective_function(f):
for r in removed:
if is_injective_function(f | {r}):
break
else:
yield f
else:
for c in f:
if removed | {c} not in previously_removed:
yield from generate_from(f - {c}, removed | {c})
return list(generate_from(corr))
This is probably a generally better performing solution, but I liked the clean algorithm of the first one better for explanation.
I was annoyed by the slowness of the above solution after the comment asking whether it scales up to 100 elements with ~15 conflicts (it would run for many minutes to solve it), so here's a faster solution that runs under 1 second for 100 elements with 15 conflicts, although the execution time still goes up exponentially, so it has its limits):
def injective_function_conflicts(f):
if not f:
return {}
conflicts = defaultdict(set)
# loop over the product f x f
for x in f:
for y in f:
# for each x and y that have a conflict in any position
if x != y and any(a == b for a, b in zip(x, y)):
# add x to y's entry and y to x's entry
conflicts[y].add(x)
conflicts[x].add(y)
return conflicts
def get_maximal_partial_subsets(conflicts, off_limits: set = None):
if off_limits is None:
off_limits = set()
while True and conflicts:
# pop elements from the conflicts, using them now, or discarding them if off-limits
k, vs = conflicts.popitem()
if k not in off_limits:
break
else:
# nothing left in conflicts that's not off-limits
yield set()
return
# generate each possible result from the rest of the conflicts, adding the conflicts vs for k to off_limits
for sub_result in get_maximal_partial_subsets(dict(conflicts), off_limits | vs):
# these results can have k added to them, as all the conflicts with k were off-limits
yield sub_result | {k}
# also generated each possible result from the rest of the conflicts without k's conflicts
for sub_result in get_maximal_partial_subsets(conflicts, off_limits):
# but only yield as a result if adding k itself to it would actually cause a conflict, avoiding duplicates
if sub_result and injective_function_conflicts(sub_result | {k}):
yield sub_result
def efficient_get_maximal_subsets(corr):
conflicts = injective_function_conflicts(corr)
final_result = list((corr - set(conflicts.keys())) | result
for result in get_maximal_partial_subsets(dict(conflicts)))
print(f'size of result and conflict: {len(final_result)}, {len(conflicts)}')
return final_result
print(efficient_get_maximal_subsets(corr={('a1', 'c1'), ('a1', 'c2'), ('a2', 'c1'), ('a3', 'c3'), ('a4', 'c4')}))
I am trying to write a python code that finds restriction enzyme sites within a sequence of DNA. Restriction enzymes cut at specific DNA sequences, however some are not so strict, for example XmnI cuts this sequence:
GAANNNNTTC
Where N can be any nucleotide (A, C, G, or T). If my math is right thats 4^4 = 256 unique sequences that it can cut. I want to make a list of these 256 short sequences, then check each one against a (longer) input DNA sequence. However, I'm having a hard time generating the 256 sequences. Here's what I have so far:
cutsequencequery = "GAANNNNTTC"
Nseq = ["A", "C", "G", "T"]
querylist = []
if "N" in cutsequencequery:
Nlist = [cutsequencequery.replace("N", t) for t in Nseq]
for j in list(Nlist):
querylist.append(j)
for i in querylist:
print(i)
print(len(querylist))
and here is the output:
GAAAAAATTC
GAACCCCTTC
GAAGGGGTTC
GAATTTTTTC
4
So it's switching each N to either A, C, G, and T, but I think I need another loop (or 3?) to generate all 256 combinations. Is there an efficient way to do this that I'm not seeing?
Maybe you should take a look into python's itertools library, which include product which creates an iterable with every combination of iterables, therefore:
from itertools import product
cutsequencequery = "GAANNNNTTC"
nseq = ["A", "C", "G", "T"]
size = cutsequencequery.count('N')
possibilities = product(*[nseq for i in range(size)])
# = ('A', 'A', 'A', 'A'), ... , ('T', 'T', 'T', 'T')
# len(list(possibilities)) = 256 = 4^4, as expected
s = set()
for n in possibilities:
print(''.join(n)) # = 'AAAA', ..., 'TTTT'
new_sequence = cutsequencequery.replace('N' * size, ''.join(n))
s.add(new_sequence)
print(new_sequence) # = 'GAAAAAATTC', ..., 'GAATTTTTTC'
print(len(s)) # 256 unique sequences
I want to find all possible combination of the following list:
data = ['a','b','c','d']
I know it looks a straightforward task and it can be achieved by something like the following code:
comb = [c for i in range(1, len(data)+1) for c in combinations(data, i)]
but what I want is actually a way to give each element of the list data two possibilities ('a' or '-a').
An example of the combinations can be ['a','b'] , ['-a','b'], ['a','b','-c'], etc.
without something like the following case of course ['-a','a'].
You could write a generator function that takes a sequence and yields each possible combination of negations. Like this:
import itertools
def negations(seq):
for prefixes in itertools.product(["", "-"], repeat=len(seq)):
yield [prefix + value for prefix, value in zip(prefixes, seq)]
print list(negations(["a", "b", "c"]))
Result (whitespace modified for clarity):
[
[ 'a', 'b', 'c'],
[ 'a', 'b', '-c'],
[ 'a', '-b', 'c'],
[ 'a', '-b', '-c'],
['-a', 'b', 'c'],
['-a', 'b', '-c'],
['-a', '-b', 'c'],
['-a', '-b', '-c']
]
You can integrate this into your existing code with something like
comb = [x for i in range(1, len(data)+1) for c in combinations(data, i) for x in negations(c)]
Once you have the regular combinations generated, you can do a second pass to generate the ones with "negation." I'd think of it like a binary number, with the number of elements in your list being the number of bits. Count from 0b0000 to 0b1111 via 0b0001, 0b0010, etc., and wherever a bit is set, negate that element in the result. This will produce 2^n combinations for each input combination of length n.
Here is one-liner, but it can be hard to follow:
from itertools import product
comb = [sum(t, []) for t in product(*[([x], ['-' + x], []) for x in data])]
First map data to lists of what they can become in results. Then take product* to get all possibilities. Finally, flatten each combination with sum.
My solution basically has the same idea as John Zwinck's answer. After you have produced the list of all combinations
comb = [c for i in range(1, len(data)+1) for c in combinations(data, i)]
you generate all possible positive/negative combinations for each element of comb. I do this by iterating though the total number of combinations, 2**(N-1), and treating it as a binary number, where each binary digit stands for the sign of one element. (E.g. a two-element list would have 4 possible combinations, 0 to 3, represented by 0b00 => (+,+), 0b01 => (-,+), 0b10 => (+,-) and 0b11 => (-,-).)
def twocombinations(it):
sign = lambda c, i: "-" if c & 2**i else ""
l = list(it)
if len(l) < 1:
return
# for each possible combination, make a tuple with the appropriate
# sign before each element
for c in range(2**(len(l) - 1)):
yield tuple(sign(c, i) + el for i, el in enumerate(l))
Now we apply this function to every element of comb and flatten the resulting nested iterator:
l = itertools.chain.from_iterable(map(twocombinations, comb))
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 8 years ago.
Improve this question
How to write a function ( a generator) that takes three letters (l1, l2, l3) and three numbers (n1, n2, n3) and gives all the possible combinations in which l1 occurs n1 times, l2 n2 times and l3 n3 times.
For example:
for i in function('a', 2, 'b', 1, 'c', 0):
print(i)
gives:
aab
baa
aba
Use itertools.permutations, all you need is a thin wrapper around it:
from itertools import permutations
def l_n_times(l1, n1, l2, n2, l3, n3):
return permutations(l1*n1 + l2*n2 + l3*n3)
Demo:
>>> for item in set(l_n_times('a', 2, 'b', 1, 'c', 0)):
... print(''.join(item))
...
baa
aba
aab
permutations already returns a generator so you don't have use yield yourself.
Doesn't seem to me that itertools would help a lot here, though a recursive implementation may look like this:
def combine(l1, n1, l2, n2, l3, n3):
counters = {l1: n1, l2: n2, l3: n3} # remaining characters to use
buf = [] # string under construction
def recur(depth):
if not depth: # we've reached the bottom
yield ''.join(buf)
return
# choosing next character
for s, c in counters.iteritems():
if not c: # this character is exhausted
continue
counters[s] -= 1
buf.append(s)
for val in recur(depth-1):
# going down recursively
yield val
# restore the state before trying next character
buf.pop()
counters[s] += 1
length = sum(counters.values())
return recur(length)
for s in combine('a', 2, 'b', 1, 'c', 0):
print s
Lets admit you have a data structure like:
letters = {'a': 2, 'b': 1, 'c': 0}
a recursive function would be:
def r(letters, prefix = ''):
for k,v in letters.items():
if v > 0:
d = dict(letters)
d[k] = v - 1
for val in r(d, prefix + k):
yield val
if all(v == 0 for _, v in letters.items()):
yield prefix
No duplicates, and it does use a generator. Quite heavy compared to a simple itertools call.
The docs for itertools have this to say;
The code for combinations() can be also expressed as a subsequence of permutations() after filtering entries where the elements are not in sorted order (according to their position in the input pool):
Since we want all combinations with no duplicates, we'll just enforce strict ordering (ie only yield values that are greater than the greatest one so far);
This would seem to do just that;
def dfunc(l,n):
old=[]
for i in it.permutations(''.join(list(a*b for a,b in sorted(it.izip(l,n))))):
if i > old:
old=i
yield i
>>> dfunc(['b','c','a'],[1,0,2])
<generator object dfunc at 0x10ba055a0>
>>> list(dfunc(['b','c','a'],[1,0,2]))
[('a', 'a', 'b'), ('a', 'b', 'a'), ('b', 'a', 'a')]