Related
I have a big list of lists of tuples like
actions = [ [('d', 'r'), ... ('c', 'e'),('', 'e')],
[('r', 'e'), ... ('c', 'e'),('d', 'r')],
... ,
[('a', 'b'), ... ('c', 'e'),('c', 'h')]
]
and i want to find the co-occurrences of the tuples.
I have tried the sugestions from this question but the accepted answer is just too slow. For example in a list with 1494 list of tuple, the resulting dictionary size is 18225703 and took hours to run for 2 tuple co-occurence. So plain permutation and counting doesn't seem to be the answer since i have a bigger list.
I expect the output to somewhat extract the most common pairs (2) or more (3,4,5 at most) tuples that co-occur the most. Using the previous list as example:
('c', 'e'),('d', 'r')
would a common co-occurence when searching for pairs since they co-occur frequently. Is there an efficient method to achieve this?
I think there is no hope for a faster algorithm: you have to compute the combinations to count them. However, if there is threshold of co-occurrences under which you are not interested, you can rty to reduce the complexity of the algorithm. In both cases, there is a hope for less space complexity.
Let's take a small example:
>>> actions = [[('d', 'r'), ('c', 'e'),('', 'e')],
... [('r', 'e'), ('c', 'e'),('d', 'r')],
... [('a', 'b'), ('c', 'e'),('c', 'h')]]
General answer
This answer is probably the best for a large list of lists, but you can avoid creating intermediary lists. First, create an iterable on all present pairs of elements (elements are pairs too in your case, but that doesn't matter):
>>> import itertools
>>> it = itertools.chain.from_iterable(itertools.combinations(pair_list, 2) for pair_list in actions)
If we want to see the result, we have to consume the iteratable:
>>> list(it)
[(('d', 'r'), ('c', 'e')), (('d', 'r'), ('', 'e')), (('c', 'e'), ('', 'e')), (('r', 'e'), ('c', 'e')), (('r', 'e'), ('d', 'r')), (('c', 'e'), ('d', 'r')), (('a', 'b'), ('c', 'e')), (('a', 'b'), ('c', 'h')), (('c', 'e'), ('c', 'h'))]
Then count the sorted pairs (with a fresh it!)
>>> it = itertools.chain.from_iterable(itertools.combinations(pair_list, 2) for pair_list in actions)
>>> from collections import Counter
>>> c = Counter((a,b) if a<=b else (b,a) for a,b in it)
>>> c
Counter({(('c', 'e'), ('d', 'r')): 2, (('', 'e'), ('d', 'r')): 1, (('', 'e'), ('c', 'e')): 1, (('c', 'e'), ('r', 'e')): 1, (('d', 'r'), ('r', 'e')): 1, (('a', 'b'), ('c', 'e')): 1, (('a', 'b'), ('c', 'h')): 1, (('c', 'e'), ('c', 'h')): 1})
>>> c.most_common(2)
[((('c', 'e'), ('d', 'r')), 2), ((('', 'e'), ('d', 'r')), 1)]
At least in term of space, this solution should be efficient since everything is lazy and the number of elements of the Counter is the number of combinations from elements in the same list, that is at most N(N-1)/2 where N is the number of distinct elements in all the lists ("at most" because some elements never "meet" each other and therefore some combination never happen).
The time complexity is O(M . L^2) where M is the number of lists and L the size of the largest list.
With a threshold on the co-occurences number
I assume that all elements in a list are distinct. The key idea is that if an element is present in only one list, then this element has strictly no chance to beat anyone at this game: it will have 1 co-occurence with all his neighbors, and 0 with the elements of other lists. If there are a lot of "orphans", it might be useful to remove them before processing computing the combinations:
>>> d = Counter(itertools.chain.from_iterable(actions))
>>> d
Counter({('c', 'e'): 3, ('d', 'r'): 2, ('', 'e'): 1, ('r', 'e'): 1, ('a', 'b'): 1, ('c', 'h'): 1})
>>> orphans = set(e for e, c in d.items() if c <= 1)
>>> orphans
{('a', 'b'), ('r', 'e'), ('c', 'h'), ('', 'e')}
Now, try the same algorithm:
>>> it = itertools.chain.from_iterable(itertools.combinations((p for p in pair_list if p not in orphans), 2) for pair_list in actions)
>>> c = Counter((a,b) if a<=b else (b,a) for a,b in it)
>>> c
Counter({(('c', 'e'), ('d', 'r')): 2})
Note the comprehension: no brackets but parentheses.
If you have K orphans in a list of N elements, your time complexity for that list falls from N(N-1)/2 to (N-K)(N-K-1)/2, that is (if I'm not mistaken!) K.(2N-K-1) combinations less.
This can be generalized: if an element is present in two or less lists, then it will have at most 2 co-occurrences with other elements, and so on.
If this is still to slow, then switch to a faster language.
I'm experimenting with sympy's permutations without replacement
from sympy.functions.combinatorial.numbers import nP
from sympy.utilities.iterables import permutations
nP('abc', 2)
# >>> 6
list(permutations('abc', 2))
# >>> [('a', 'b'), ('a', 'c'), ('b', 'a'), ('b', 'c'), ('c', 'a'), ('c', 'b')]
Next, I wan't to try permutations with replacement. It seems that there isn't a permuations_with_replacement() method similar to the combinations_with_replacement() method, but there is a variations() method:
from sympy.utilities.iterables import variations
nP('abc', 2, replacement=True)
# >>> 9
list(variations('abc', 2, repetition=True))
# >>>
[('a', 'a'),
('a', 'b'),
('a', 'c'),
('b', 'a'),
('b', 'b'),
('b', 'c'),
('c', 'a'),
('c', 'b'),
('c', 'c')]
Does the variations() method perform the same function as I am expecting with permutations_with_replacement() to do?
See also: sympy.utilities.iterables.combinations() with replacement?
The variations method does exactly what you think it does, which is to calculate the Cartesian product, aptly named product, method of the package.
This means that list(sympy.utilities.iterables.product('abc', repeat=2) will yield the same results.
With repetition=False, variations is equal to permutations instead.
This can also be seen from the internal code of variations:
if not repetition:
seq = tuple(seq)
if len(seq) < n:
return
for i in permutations(seq, n):
yield i
else:
if n == 0:
yield ()
else:
for i in product(seq, repeat=n):
yield i
I have a list of lines Lines=([('B', 'C'), ('D', 'A'), ('D', 'C'), ('A', 'B'), ('D', 'B')]) and geometry = ('B', 'C', 'D') is a list of points that set up the triangle (B,C,D).
I want to check whether geometry can be set up from list of lines in Lines. How can I create a function to check that status? True or False.
Sample Functionality with input Lines:
>> Lines=([('B', 'C'), ('D', 'A'), ('D', 'C'), ('A', 'B'), ('D', 'B'),])
>> geometry1 = ('B', 'C', 'D')
>> check_geometry(Lines, geometry1)
True
>> geometry2 = ('A', 'B', 'E')
>> check_geometry(Lines, geometry2)
False
This is my code, but the result is wrong:
import itertools
def check_geometry(line, geometry):
dataE = [set(x) for x in itertools.combinations(geometry, 2)]
for data in dataE:
if data not in line:
return False
return True
Lines = [('B', 'C'), ('D', 'A'), ('D', 'C'), ('A', 'B'), ('D', 'B'),]
geometry1 = ('B', 'C', 'D')
print check_geometry(Lines, geometry1)
Output:
False
For triangles:
You could use the built-in all to do this, making sure to first sort the list contents since their order might differ than that generated from itertools.combinations:
sLines = [tuple(sorted(l)) for l in Lines]
dataE = itertools.combinations('BCD', 2)
Now you can call all which will check that every value in dataE is present in sLines:
all(l1 in sLines for l1 in dataE)
Which will return True.
So, your check_geometry function could look something like:
def check_geometry(line, geometry):
sLines = [tuple(sorted(l)) for l in line]
dataE = itertools.combinations(geometry, 2)
return all(l1 in sLines for l1 in dataE)
Calls made will now check if the Lines contain the geometry:
check_geometry(Lines, 'BCD')
# returns True
check_geometry(Lines, 'ABE')
# returns False
A bit more general:
To generalize this a bit, we can drop itertools.combinations and instead utilize zip. The following makes some appropriate changes to the function in order to acommodate zip but performs similar stuff:
def check_geometry(line, geometry):
sLines = [sorted(l) for l in line]
dataE = [sorted(x) for x in zip(geometry, geometry[1:] + geometry[:1])]
return all(l1 in sLines for l1 in dataE)
The key difference here is:
dataE is now a list of lists containing the result of zip(geometry, geometry[1:] + geometry[:1]). What zip does in this case is it takes a string like "BCDA" and the same string with the first element added to the end geometry[1:] + geometry[:1] (i.e "CDAB") and creates entries signifying the sides of a shape:
>>> s = "BCDA"
>>> s[1:] + s[:1]
>>> 'CDAB'
>>> list(zip(s, s[1:] + s[:1]))
[('B', 'C'), ('C', 'D'), ('D', 'A'), ('A', 'B')]
Now we can check that a geometry with points "BCDA" can be constructed by the lines in Lines:
check_geometry(Lines, "BCD")
# True
check_geometry(Lines, "BCDA")
# True
check_geometry(Lines, "BCDF")
# False
Note 1: Lines can be written as:
Lines=[('B', 'C'), ('D', 'A'), ('D', 'C'), ('A', 'B'), ('D', 'B')]
The parenthesis () and comma , have no additional effect here, you can drop them :-) .
Note 2: The geometry parameter for check_geometry can be any iterable (tuples, lists, strings):
check_geometry(lines, "BCD") == check_geometry(lines, ('B', 'C', 'D'))
Creating and passing a tuple to it seems somewhat odd in this case (alas, you might have a good reason to do so). Unless reasons require it, I would suggest going with strings as the value for parameter geometry.
I think A,B,C can be string or whatever which define a point that set up a line
Okay, I'll be using strings for my answer then, you should be able to adjust the code to your needs.
def check_for_triangle(tri, lines):
lines_needed = zip(tri, (tri[1], tri[2], tri[0]))
return all(line in lines or line[::-1] in lines for line in lines_needed)
lines=[('B', 'C'), ('D', 'A'), ('D', 'C'), ('A', 'B'), ('D', 'B')]
tri1 = ('B', 'C', 'D')
tri2 = ('A', 'B', 'E')
print(check_for_triangle(tri1, lines)) # True
print(check_for_triangle(tri2, lines)) # False
The idea is to generate all lines (represented by a pair of points) we need to find in lines for a given triangle with zip. After that, we check whether all these lines can be found in lines.
Checking for line[::-1] as well is needed because the line ('A', 'B') is the same line as ('B', 'A').
I have a list in the form of
[(u'a1', u'b1'),
(u'a1', u'b2'),
(u'c1', u'c2')]
I want it two be split into two lists/columns like
list1 list2
[(u'a1', [(u'b1'),
(u'a1', (u'b2'),
(u'c1')] (u'c2')]
Conversion of unicode to string would also help!
Also, in another case, I have list in the form of
[(('a', 'c'), -3), (('a', 'd'), -7), (('c', 'd'), -4)]
I need the input in the form of
('a','a','c')
('c','d','d')
(-3,-7,-4)
Any tips?
You could create two new list using lists comprehension:
x=[(u'a1', u'b1'),
(u'a1', u'b2'),
(u'c1', u'c2')]
list1 = [i[0] for i in x]
list2 = [i[1] for i in x]
The second example:
>>> L = [(('a', 'c'), -3), (('a', 'd'), -7), (('c', 'd'), -4)]
>>> zip(*[(a[0], a[1], b) for a, b in L])
[('a', 'a', 'c'), ('c', 'd', 'd'), (-3, -7, -4)]
It first flattens each item and then transposes the list.
I have a list of tuples:
lst = [('a','b'), ('c', 'b'), ('a', 'd'), ('e','f'), ('a', 'b')]
I want the following output list:
output = [('a','b'), ('e','f')]
i.e I want to compare the elements of first tuple with remaining tuples and remove the tuple which contains either one or more duplicate elements.
My attempt:
I was thinking of using for loops, but that wont be feasible once i have very large list. I browsed through following posts but could not get the right solution:
Removing duplicates members from a list of tuples
How do you remove duplicates from a list in whilst preserving order?
If somebody could guide me the right direction, it will be very helpful. Thanks!
Assuming that you want "duplicates" of all elements to be suppressed, and not just the first one, you could use:
lst = [('a','b'), ('c', 'b'), ('a', 'd'), ('e','f'), ('a', 'b')]
def merge(x):
s = set()
for i in x:
if not s.intersection(i):
yield i
s.update(i)
gives
>>> list(merge(lst))
[('a', 'b'), ('e', 'f')]
>>> list(merge([('a', 'b'), ('c', 'd'), ('c', 'e')]))
[('a', 'b'), ('c', 'd')]
>>> list(merge([('a', 'b'), ('a', 'c'), ('c', 'd')]))
[('a', 'b'), ('c', 'd')]
Sets should help:
>>> s = map(set, lst)
>>> first = s[0]
>>> [first] + [i for i in s if not i & first]
[set(['a', 'b']), set(['e', 'f'])]
Or with ifilterfalse:
>>> from itertools import ifilterfalse
>>> s = map(set, lst)
>>> [first] + list(ifilterfalse(first.intersection, s))
[set(['a', 'b']), set(['e', 'f'])]