SQL style inner join in Python? - python

I have two array like this:
[('a', 'beta'), ('b', 'alpha'), ('c', 'beta'), .. ]
[('b', 37), ('c', 22), ('j', 93), .. ]
I want to produce something like:
[('b', 'alpha', 37), ('c', 'beta', 22), .. ]
Is there an easy way to do this?

I would suggest a hash discriminator join like method:
l = [('a', 'beta'), ('b', 'alpha'), ('c', 'beta')]
r = [('b', 37), ('c', 22), ('j', 93)]
d = {}
for t in l:
d.setdefault(t[0], ([],[]))[0].append(t[1:])
for t in r:
d.setdefault(t[0], ([],[]))[1].append(t[1:])
from itertools import product
ans = [ (k,) + l + r for k,v in d.items() for l,r in product(*v)]
results in:
[('c', 'beta', 22), ('b', 'alpha', 37)]
This has lower complexity closer to O(n+m) than O(nm) because it avoids computing the product(l,r) and then filtering as the naive method would.
Mostly from: Fritz Henglein's Relational algebra with discriminative joins and lazy products
It can also be written as:
def accumulate(it):
d = {}
for e in it:
d.setdefault(e[0], []).append(e[1:])
return d
l = accumulate([('a', 'beta'), ('b', 'alpha'), ('c', 'beta')])
r = accumulate([('b', 37), ('c', 22), ('j', 93)])
from itertools import product
ans = [ (k,) + l + r for k in l&r for l,r in product(l[k], r[k])]
This accumulates both lists separately (turns [(a,b,...)] into {a:[(b,...)]}) and then computes the intersection between their sets of keys. This looks cleaner. if l&r is not supported between dictionaries replace it with set(l)&set(r).

There is no built in method. Adding package like numpy will give extra functionalities, I assume.
But if you want to solve it without using any extra packages, you can use a one liner like this:
ar1 = [('a', 'beta'), ('b', 'alpha'), ('c', 'beta')]
ar2 = [('b', 37), ('c', 22), ('j', 93)]
final_ar = [tuple(list(i)+[j[1]]) for i in ar1 for j in ar2 if i[0]==j[0]]
print(final_ar)
Output:
[('b', 'alpha', 37), ('c', 'beta', 22)]

Related

How to get the maximum amount of tuples that are not contained in another set yet?

Given a list of letters, say L=['a','b','c','d','e','f'] and a list of tuples, for example T=[('a','b'),('a','c'),('b','c')].
Now I want to create the maximum amount of possible tuples from the list of L that are not contained in T already. This needs to be done without duplicates, i.e. (a,b) would be the same as (b,a). Also, each letter can only be matched with one other letter.
My idea was:
#create a List of all possible tuples first:
all_tuples = [(x,y) for x in L for y in L if x!=y]
#now remove duplicates
unique_tuples = list(set([tuple(sorted(elem)) for elem in all_tuples]))
#Now, create a new set that matches each letter only once with another letter
visited=set()
output = []
for letter1, letter2 in unique tuples:
if ((letter1, letter2) or (letter2, letter1)) in T:
continue
if not letter1 in visited and not letter2 in visited:
visited.add(letter1)
visited.add(letter2)
output.append((letter1,letter2))
print(output)
However, this does not always give the maximum amount of possible tuples, depending on what T is. For example, let's say we extract the possible unique_tuples=[('a','b'),('a','d'),('b','c')].
If we append ('a','b') first to our output, we cannot append ('b','c') anymore, since 'b' was matched already. However, if we appended ('a','d') first, we could also get ('b','c') afterwards and get the maximum amount of two tuples.
How can one solve this?
If we ignore the business about not matching the same letter twice, this is a straightforward use of combinations:
>>> from itertools import combinations
>>> L=['a','b','c','d','e','f']
>>> T=[('a','b'),('a','c'),('b','c')]
>>> [t for t in combinations(L, 2) if t not in T]
[('a', 'd'), ('a', 'e'), ('a', 'f'), ('b', 'd'), ('b', 'e'), ('b', 'f'), ('c', 'd'), ('c', 'e'), ('c', 'f'), ('d', 'e'), ('d', 'f'), ('e', 'f')]
If we limit ourselves to only using each letter once, the problem is very straightforward, because we know that we can only have (letters / 2) tuples at most. Just find the available letters (by subtracting those already present in T) and then pair them up in any arbitrary order.
>>> used_letters = {c for t in T for c in t}
>>> free_letters = [c for c in L if c not in used_letters]
>>> [tuple(free_letters[i:i+2]) for i in range(0, 2 * (len(free_letters) // 2), 2)]
[('d', 'e')]
Without using libraries, you could do it like this:
L=['a','b','c','d','e','f']
T=[('a','b'),('a','c'),('b','c')]
L = sorted(L,key=lambda c: -sum(c in t for t in T))
used = set()
r = [(a,b) for i,a in enumerate(L) for b in L[i+1:]
if (a,b) not in T and (b,a) not in T
and used.isdisjoint((a,b)) and not used.update((a,b))]
print(r)
[('a', 'd'), ('b', 'e'), ('c', 'f')]
The letters are sorted in descending order of frequency in T before combining them. This ensures that the hardest letters to match are processed first thus maximizing the pairing potential for the remaining letters.
Alternatively, you could use a recursive (DP) approach and check all possible pairing combinations.
def maxTuples(L,T):
maxCombos = [] # will return longest
for i,a in enumerate(L): # first letter of tuple
for j,b in enumerate(L[i+1:],i+1): # second letter of tuple
if (a,b) in T: continue # tuple not in T
if (b,a) in T: continue # inverted tuple not in T
rest = L[:i]+L[i+1:j]+L[j+1:] # recurse with rest of letters
R = [(a,b)]+maxTuples(rest,T) # adding to selected pair
if len(R)*2+1>=len(L): return R # max possible, stop here
if len(R)>len(maxCombos): # longer combination of tuples
maxCombos = R # Track it
return maxCombos
...
L=['a','b','c','d','e','f']
T=[('a','b'),('a','c'),('b','c'),('c','f')]
print(maxTuples(L,T))
[('a', 'd'), ('b', 'f'), ('c', 'e')]
L = list("ABCDEFGHIJKLMNOP")
T = [('K', 'N'), ('G', 'F'), ('I', 'P'), ('C', 'A'), ('O', 'M'),
('D', 'B'), ('L', 'J'), ('E', 'H'), ('F', 'E'), ('L', 'H'),
('J', 'G'), ('N', 'I'), ('C', 'M'), ('A', 'P'), ('D', 'O'),
('K', 'B'), ('G', 'H'), ('O', 'A'), ('I', 'J'), ('N', 'M'),
('F', 'P'), ('E', 'B'), ('K', 'L'), ('D', 'C'), ('D', 'E'),
('L', 'F'), ('B', 'H'), ('I', 'A'), ('K', 'G'), ('M', 'O'),
('P', 'C'), ('N', 'J'), ('J', 'E'), ('N', 'P'), ('A', 'G'),
('H', 'O'), ('I', 'B'), ('K', 'F'), ('M', 'C'), ('L', 'D'),
('A', 'B'), ('C', 'E'), ('D', 'F'), ('G', 'I'), ('H', 'J'),
('K', 'M'), ('L', 'N'), ('O', 'P')]
print(maxTuples(L,T))
[('A', 'D'), ('B', 'C'), ('E', 'G'), ('F', 'H'),
('I', 'K'), ('J', 'M'), ('L', 'P'), ('N', 'O')]
Note that the function will be slow if the tuples in T exclude so many pairings that it is impossible to produce a combination of len(L)/2 tuples. It can be optimized further by filtering letters that are completely excluded as we go down the recursion:
def maxTuples(L,T):
if not isinstance(T,dict):
T,E = {c:{c} for c in L},T # convert T to a dictionary
for a,b in E: T[a].add(b);T[b].add(a) # of excluded letter sets
L = [c for c in L if not T[c].issuperset(L)] # filter fully excluded
maxCombos = [] # will return longest
for i,a in enumerate(L): # first letter of tuple
for j,b in enumerate(L[i+1:],i+1): # second letter of tuple
if b in T[a]: continue # exclude tuples in T
rest = L[:i]+L[i+1:j]+L[j+1:] # recurse with rest of letters
R = [(a,b)]+maxTuples(rest,T) # adding to selected pair
if len(R)*2+1>=len(L): return R # max possible, stop here
if len(R)>len(maxCombos): # longer combination of tuples
maxCombos = R # Track it
return maxCombos

Python calculate co-occurrence of tuples in list of lists of tuples

I have a big list of lists of tuples like
actions = [ [('d', 'r'), ... ('c', 'e'),('', 'e')],
[('r', 'e'), ... ('c', 'e'),('d', 'r')],
... ,
[('a', 'b'), ... ('c', 'e'),('c', 'h')]
]
and i want to find the co-occurrences of the tuples.
I have tried the sugestions from this question but the accepted answer is just too slow. For example in a list with 1494 list of tuple, the resulting dictionary size is 18225703 and took hours to run for 2 tuple co-occurence. So plain permutation and counting doesn't seem to be the answer since i have a bigger list.
I expect the output to somewhat extract the most common pairs (2) or more (3,4,5 at most) tuples that co-occur the most. Using the previous list as example:
('c', 'e'),('d', 'r')
would a common co-occurence when searching for pairs since they co-occur frequently. Is there an efficient method to achieve this?
I think there is no hope for a faster algorithm: you have to compute the combinations to count them. However, if there is threshold of co-occurrences under which you are not interested, you can rty to reduce the complexity of the algorithm. In both cases, there is a hope for less space complexity.
Let's take a small example:
>>> actions = [[('d', 'r'), ('c', 'e'),('', 'e')],
... [('r', 'e'), ('c', 'e'),('d', 'r')],
... [('a', 'b'), ('c', 'e'),('c', 'h')]]
General answer
This answer is probably the best for a large list of lists, but you can avoid creating intermediary lists. First, create an iterable on all present pairs of elements (elements are pairs too in your case, but that doesn't matter):
>>> import itertools
>>> it = itertools.chain.from_iterable(itertools.combinations(pair_list, 2) for pair_list in actions)
If we want to see the result, we have to consume the iteratable:
>>> list(it)
[(('d', 'r'), ('c', 'e')), (('d', 'r'), ('', 'e')), (('c', 'e'), ('', 'e')), (('r', 'e'), ('c', 'e')), (('r', 'e'), ('d', 'r')), (('c', 'e'), ('d', 'r')), (('a', 'b'), ('c', 'e')), (('a', 'b'), ('c', 'h')), (('c', 'e'), ('c', 'h'))]
Then count the sorted pairs (with a fresh it!)
>>> it = itertools.chain.from_iterable(itertools.combinations(pair_list, 2) for pair_list in actions)
>>> from collections import Counter
>>> c = Counter((a,b) if a<=b else (b,a) for a,b in it)
>>> c
Counter({(('c', 'e'), ('d', 'r')): 2, (('', 'e'), ('d', 'r')): 1, (('', 'e'), ('c', 'e')): 1, (('c', 'e'), ('r', 'e')): 1, (('d', 'r'), ('r', 'e')): 1, (('a', 'b'), ('c', 'e')): 1, (('a', 'b'), ('c', 'h')): 1, (('c', 'e'), ('c', 'h')): 1})
>>> c.most_common(2)
[((('c', 'e'), ('d', 'r')), 2), ((('', 'e'), ('d', 'r')), 1)]
At least in term of space, this solution should be efficient since everything is lazy and the number of elements of the Counter is the number of combinations from elements in the same list, that is at most N(N-1)/2 where N is the number of distinct elements in all the lists ("at most" because some elements never "meet" each other and therefore some combination never happen).
The time complexity is O(M . L^2) where M is the number of lists and L the size of the largest list.
With a threshold on the co-occurences number
I assume that all elements in a list are distinct. The key idea is that if an element is present in only one list, then this element has strictly no chance to beat anyone at this game: it will have 1 co-occurence with all his neighbors, and 0 with the elements of other lists. If there are a lot of "orphans", it might be useful to remove them before processing computing the combinations:
>>> d = Counter(itertools.chain.from_iterable(actions))
>>> d
Counter({('c', 'e'): 3, ('d', 'r'): 2, ('', 'e'): 1, ('r', 'e'): 1, ('a', 'b'): 1, ('c', 'h'): 1})
>>> orphans = set(e for e, c in d.items() if c <= 1)
>>> orphans
{('a', 'b'), ('r', 'e'), ('c', 'h'), ('', 'e')}
Now, try the same algorithm:
>>> it = itertools.chain.from_iterable(itertools.combinations((p for p in pair_list if p not in orphans), 2) for pair_list in actions)
>>> c = Counter((a,b) if a<=b else (b,a) for a,b in it)
>>> c
Counter({(('c', 'e'), ('d', 'r')): 2})
Note the comprehension: no brackets but parentheses.
If you have K orphans in a list of N elements, your time complexity for that list falls from N(N-1)/2 to (N-K)(N-K-1)/2, that is (if I'm not mistaken!) K.(2N-K-1) combinations less.
This can be generalized: if an element is present in two or less lists, then it will have at most 2 co-occurrences with other elements, and so on.
If this is still to slow, then switch to a faster language.

How to convert a Python multilevel dictionary into tuples?

I have a multi level dictionary, example below, which needs to be converted into tuples in reverse order i.e, the innermost elements should be used to create tuple first.
{a: {b:c, d:{e:f, g:h, i:{j:['a','b']}}}}
Output should be something like this:
[(j,['a','b']), (i,j), (g,h), (e,f), (d,e), (d,g), (d,i), (b,c), (a,b), (a,d)]
There you go, this will produce what you want (also tested):
def create_tuple(d):
def create_tuple_rec(d, arr):
for k in d:
if type(d[k]) is not dict:
arr.append((k, d[k]))
else:
for subk in d[k]:
arr.append((k, subk))
create_tuple_rec(d[k], arr)
return arr
return create_tuple_rec(d, [])
# Running this
d = {'a': {'b':'c', 'd':{'e':'f', 'g':'h', 'i':{'j':['a','b']}}}}
print str(create_tuple(d))
# Will print:
[('a', 'b'), ('a', 'd'), ('b', 'c'), ('d', 'i'), ('d', 'e'), ('d', 'g'), ('i', 'j'), ('j', ['a', 'b']), ('e', 'f'), ('g', 'h')]

Python: union of set of tuples

Let's say we have two sets:
t = {('b', 3), ('a', 2)}
r = {('b', 4), ('c', 6)}
I want a union on 1st element to result in
u = {('b', 3), ('a', 2), ('c', 6)}
if duplicate symbol is present in both place (example 'b' in the above) then the element of the first list should be retained.
Thanks.
Just do:
t = {('b', 3), ('a', 2)}
r = {('b', 4), ('c', 6)}
d = dict(r)
d.update(t)
u = set(d.items())
print(u)
Output:
{('c', 6), ('a', 2), ('b', 3)}
A little bit shorter version:
s = dict((*r, *t))
set(s.items())
Output:
{('a', 2), ('b', 3), ('c', 6)}
for el in r:
if not el[0] in [x[0] for x in t]:
t.add(el)
t
{('a', 2), ('b', 3), ('c', 6)}
You can't do that with set intersecion. Two objects are either equal or they are not. Since your objects are tuples, (b, 3) and (b, 4) are not equal, and you don't get to change that.
The obvious way would be to create your own class and redefine equality, something like
class MyTuple:
def __init__(self, values):
self.values = values
def __eq__(self, other):
return self.values[0] == other[0]
and create sets of such objects.
An alternative using chain:
from itertools import chain
t = {('b', 3), ('a', 2)}
r = {('b', 4), ('c', 6)}
result = set({k: v for k, v in chain(r, t)}.items())
Output
{('b', 3), ('a', 2), ('c', 6)}
Here is my one-line style solution based on comprehensions:
t = {('b', 3), ('a', 2)}
r = {('b', 4), ('c', 6)}
result = {*t, *{i for i in r if i[0] not in {j[0] for j in t}}}
print(result) # {('b', 3), ('a', 2), ('c', 6)}
Using conversion to dictionary to eliminate the duplicates, you can also do that, which is a quite smart solution IMHO:
t = {('b', 3), ('a', 2)}
r = {('b', 4), ('c', 6)}
result = {(k,v) for k,v in dict((*r,*t)).items()}
print(result) # {('b', 3), ('a', 2), ('c', 6)}

merging n sorted lists of tuples in python

I have n lists (n<10) of tuples in the format [(ListID, [(index,value),(index, value),...)] and want to sort them by index to get to following outcome
Example Input:
[('A',[(0.12, 'how'),(0.26,'are'),(0.7, 'you'),(0.9,'mike'),(1.9, "I'm fine too")]),
('B',[(1.23, 'fine'),(1.50, 'thanks'),(1.6,'and you')]),
('C',[(2.12,'good'),(2.24,'morning'),(3.13,'guys')])]
Desired Output:
[('A', ( 0.12, 'how')),
('A', ( 0.26, 'are')),
('A', ( 0.7, 'you')),
('A', ( 0.9, 'mike')),
('B',(1.23, 'fine')),
('B',(1.50, 'thanks')),
('B',(1.6,'and you')),
('A', (1.9, "I'm fine too")),
('C',(2.12,'good')),
('C',(2.24,'morning')),
('C',(3.13,'guys'))]
I know the code is ugly, especially those indices item[0][-1][1], but can somebody tell me what am I doing wrong?
content = []
max = 0.0
first = True
Done = False
finished = []
while not Done:
for item in flow:
if len(finished) == 4:
Done = True
break
if len(item[1]) == 0:
if item[0] not in finished:
finished.append(item[0])
continue
if first == True:
max = item[1][-1][0]
content.append((item[0], item[1].pop()))
first = False
continue
if item[1][-1][0] > max:
max = item[1][-1][0]
content.append((item[0], item[1].pop()))
content = sorted(content, key=itemgetter(1))
first = True
UPDATE:
thank you everybody
>>> from operator import itemgetter
>>> import pprint
>>> pprint.pprint(sorted(((i,k) for i,j in INPUT for k in j), key=itemgetter(1)))
[('A', (0.12, 'how')),
('A', (0.26000000000000001, 'are')),
('A', (0.69999999999999996, 'you')),
('A', (0.90000000000000002, 'mike')),
('B', (1.23, 'fine')),
('B', (1.5, 'thanks')),
('B', (1.6000000000000001, 'and you')),
('A', (1.8999999999999999, "I'm fine")),
('C', (2.1200000000000001, 'good')),
('C', (2.2400000000000002, 'morning')),
('C', (3.1299999999999999, 'guys'))]
There are two main things going on here
[(i,k) for i,j in INPUT for k in j]
takes transforms the INPUT to this struture
[('A', (0.12, 'how')),
('A', (0.26, 'are')),
('A', (0.7, 'you')),
('A', (0.9, 'mike')),
('A', (1.9, "I'm fine")),
('B', (1.23, 'fine')),
('B', (1.5, 'thanks')),
('B', (1.6, 'and you')),
('C', (2.12, 'good')),
('C', (2.24, 'morning')),
('C', (3.13, 'guys'))]
and
sorted(L, key=itemgetter(1))
sorts L buy item[1] of each element. This is actually (0.12, 'how'), (0.27, 'are') ... but the normal way python sorts tuples is from left to right, so we don't need to do extra work to strip the word from the tuple
(OK, the sample data makes the problem description much clearer. Answer revised accordingly)
Step 1: clarify your problem description by reverse engineering your current solution.
There are 4 different data sets labelled A, B, C and D
These data sets are contained in a series of 2-tuples of the form (ListID, elements)
Each "elements" entry is itself a list of 2-tuples of the form (index, value)
An empty elements entry indicates the end of a data set
The goal is to merge these data sets into a single sorted list of 2-tuples (ListID, (index, value))
Step 2: transform the input data to create individual records of the desired form.
Generators are built for this kind of thing, so it makes sense to define one.
def get_data(flow, num_data_sets=4):
finished = set()
for list_id, elements in flow:
if list_id in finished:
continue
if not elements:
finished.add(list_id)
if len(finished) == num_data_sets:
break
continue
for element in elements:
yield list_id, element
Step 3: use sorted to produce the desired ordered list
content = sorted(get_data(flow))
Sample usage:
# get_data defined via copy/paste of source code above
# ref_data taken from the revised question
>>> demo_data = [
... ('A', [(1, 2), (3, 4)]),
... ('B', [(7, 8), (9, 10)]),
... ('A', [(0, 0)]),
... ('C', []), # Finish early
... ('C', [('ignored', 'entry')])
... ]
>>> content = sorted(get_data(demo_data))
>>> print '\n'.join(map(str, content))
('A', 0, 0)
('A', 1, 2)
('A', 3, 4)
('B', 7, 8)
('B', 9, 10)
>>> content = sorted(get_data(ref_data), key=itemgetter(1))
>>> print '\n'.join(map(str, content))
('A', 0.12, 'how')
('A', 0.26, 'are')
('A', 0.7, 'you')
('A', 0.9, 'mike')
('B', 1.23, 'fine')
('B', 1.5, 'thanks')
('B', 1.6, 'and you')
('A', 1.9, "I'm fine too")
('C', 2.12, 'good')
('C', 2.24, 'morning')
('C', 3.13, 'guys')
Your solution ends up being messy and hard to read for two main reasons:
Failing to use a generator means you aren't gaining the full benefit of the builtin sorted function
By using indexing instead of tuple unpacking you make it very hard to keep track of what is what
data = [(x,id) for (id, xs) in data for x in xs]
data.sort()
for xs,id in data:
print id,xs
A (0.12, 'how')
A (0.26000000000000001, 'are')
A (0.69999999999999996, 'you')
A (0.90000000000000002, 'mike')
B (1.23, 'fine')
B (1.5, 'thanks')
B (1.6000000000000001, 'and you')
A (1.8999999999999999, "I'm fine too")
C (2.1200000000000001, 'good')
C (2.2400000000000002, 'morning')
C (3.1299999999999999, 'guys')
Your input:
l = [('A',
[(0.12, 'how'),
(0.26000000000000001, 'are'),
(0.69999999999999996, 'you'),
(0.90000000000000002, 'mike'),
(1.8999999999999999, "I'm fine too")]),
('B', [(1.23, 'fine'), (1.5, 'thanks'), (1.6000000000000001, 'and you')]),
('C',
[(2.1200000000000001, 'good'),
(2.2400000000000002, 'morning'),
(3.1299999999999999, 'guys')])]
Convert (and print):
newlist = []
for alpha, tuplelist in l:
for tup in tuplelist:
newlist.append((alpha,tup))
from operator import itemgetter
sorted(newlist,key=itemgetter(1))
print newlist
Check!
[('A', (0.12, 'how')),
('A', (0.26000000000000001, 'are')),
('A', (0.69999999999999996, 'you')),
('A', (0.90000000000000002, 'mike')),
('B', (1.23, 'fine')),
('B', (1.5, 'thanks')),
('B', (1.6000000000000001, 'and you')),
('A', (1.8999999999999999, "I'm fine too")),
('C', (2.1200000000000001, 'good')),
('C', (2.2400000000000002, 'morning')),
('C', (3.1299999999999999, 'guys'))]
You can of course do it within the list comprehension, but you still use 2 for loops and 1 inbuilt sorted function. Might as well make it verbose and readable then.

Categories

Resources