find the "overlap" between 2 python lists

find the "overlap" between 2 python lists - python

Given 2 lists:
a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]
I want to find the "overlap":
c = [3,4,5,5,6]
I'd also like it if i could extract the "remainder" the part of a and b that's not in c.
a_remainder = [5,]
b_remainder = [1,4,7,]
Note:
a has three 5's in it and b has two.
b has two 4's in it and a has one.
The resultant list c should have two 5's (limited by list b) and one 4 (limited by list a).
This gives me what i want, but I can't help but think there's a much better way.
import copy
a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]
c = []
for elem in copy.deepcopy(a):
if elem in b:
a.pop(a.index(elem))
c.append(b.pop(b.index(elem)))
# now a and b both contain the "remainders" and c contains the "overlap"
On another note, what is a more accurate name for what I'm asking for than "overlap" and "remainder"?

collection.Counter available in Python 2.7 can be used to implement multisets that do exactly what you want.
a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]
a_multiset = collections.Counter(a)
b_multiset = collections.Counter(b)
overlap = list((a_multiset & b_multiset).elements())
a_remainder = list((a_multiset - b_multiset).elements())
b_remainder = list((b_multiset - a_multiset).elements())
print overlap, a_remainder, b_remainder

Use python set
intersection = set(a) & set(b)
a_remainder = set(a) - set(b)
b_remainder = set(b) - set(a)

In the language of sets, overlap is 'intersection' and remainder is 'set difference'. If you had distinct items, you wouldn't have to do these operations yourself, check out http://docs.python.org/library/sets.html if you're interested.
Since we're not working with distinct elements, your approach is reasonable. If you wanted this to run faster, you could create a dictionary for each list and map the number to how many elements are in each array (e.g., in a, 3->1, 4->1, 5->2, etc.). You would then iterate through map a, determine if that letter existed, decrement its count and add it to the new list
Untested code, but this is the idea
def add_or_update(map,value):
if value in map:
map[value]+=1
else
map[value]=1
b_dict = dict()
for b_elem in b:
add_or_update(b_dict,b_elem)
intersect = []; diff = [];
for a_elem in a:
if a_elem in b_dict and b_dict[a_elem]>0:
intersect.add(a_elem);
for k,v in diff:
for i in range(v):
diff.add(k);

OK, verbose, but kind of cool (similar in spirit to the collections.Counter idea, but more home-made):
import itertools as it
flatten = it.chain.from_iterable
sorted(
v for u,v in
set(flatten(enumerate(g)
for k, g in it.groupby(a))).intersection(
set(flatten(enumerate(g)
for k, g in it.groupby(b))))
)
The basic idea is to make each of the lists into a new list which attaches a counter to each object, numbered to account for duplicates -- so that then you can then use set operations on these tuples after all.
To be slightly less verbose:
aa = set(flatten(enumerate(g) for k, g in it.groupby(a)))
bb = set(flatten(enumerate(g) for k, g in it.groupby(b)))
# aa = set([(0, 3), (0, 4), (0, 5), (0, 6), (1, 5), (2, 5)])
# bb = set([(0, 1), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (1, 4), (1, 5)])
cc = aa.intersection(bb)
# cc = set([(0, 3), (0, 4), (0, 5), (0, 6), (1, 5)])
c = sorted(v for u,v in cc)
# c = [3, 4, 5, 5, 6]
groupby -- produces a list of lists containing identical elements
(but because of the syntax needs the g for k,g in it.groupby(a) to extract each list)
enumerate -- appends a counter to each element of each sublist
flatten -- create a single list
set -- convert to a set
intersection -- find the common elements
sorted(v for u,v in cc) -- get rid of the counters and sort the result
Finally, I'm not sure what you mean by the remainders; it seems like it ought to be my aa-cc and bb-cc but I don't know where you get a_remainder = [4]:
sorted(v for u,v in aa-cc)
# [5]
sorted(v for u,v in bb-cc)
# [1, 4, 7]

A response from kerio in #python on freenode:
[ i for i in itertools.chain.from_iterable([k] * v for k, v in \
(Counter(a) & Counter(b)).iteritems())
]

Try difflib.SequenceMatcher(), "a flexible class for comparing pairs of sequences of any type"...
A quick try:
a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]
sm = difflib.SequenceMatcher(None, a, b)
c = []
a_remainder = []
b_remainder = []
for tag, i1, i2, j1, j2 in sm.get_opcodes():
if tag == 'replace':
a_remainder.extend(a[i1:i2])
b_remainder.extend(b[j1:j2])
elif tag == 'delete':
a_remainder.extend(a[i1:i2])
elif tag == 'insert':
b_remainder.extend(b[j1:j2])
elif tag == 'equal':
c.extend(a[i1:i2])
And now...
>>> print c
[3, 4, 5, 5, 6]
>>> print a_remainder
[5]
>>> print b_remainder
[1, 4, 7]

Aset = Set(a);
Bset = Set(b);
a_remainder = a.difference(b);
b_remainder = b.difference(a);
c = a.intersection(b);
But if you need c to have duplicates, and order is important for you,
you may look for w:Longest common subsequence problem

I don't think you should actually use this solution, but I took this opportunity to practice with lambda functions and here is what I came up with :)
a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]
dedup = lambda x: [set(x)] if len(set(x)) == len(x) else [set(x)] + dedup([x[i] for i in range(1, len(x)) if x[i] == x[i-1]])
default_set = lambda x: (set() if x[0] is None else x[0], set() if x[1] is None else x[1])
deduped = map(default_set, map(None, dedup(a), dedup(b)))
get_result = lambda f: reduce(lambda x, y: list(x) + list(y), map(lambda x: f(x[0], x[1]), deduped))
c = get_result(lambda x, y: x.intersection(y)) # [3, 4, 5, 6, 5]
a_remainder = get_result(lambda x, y: x.difference(y)) # [5]
b_remainder = get_result(lambda x, y: y.difference(x)) # [1, 7, 4]
I'm pretty sure izip_longest would have simplified this a bit (wouldn't have needed the default_set lambda), but I was testing this with Python 2.5.
Here are some of the intermediate values used in the calculation in case anyone wants to understand this:
dedup(a) = [set([3, 4, 5, 6]), set([5]), set([5])]
dedup(b) = [set([1, 3, 4, 5, 6, 7]), set([4, 5])]
deduped = [(set([3, 4, 5, 6]), set([1, 3, 4, 5, 6, 7])), (set([5]), set([4, 5])), (set([5]), set([]))]

Related

Don't understand Python Expression

I have some basic knowledge on Python but I have no idea what's going for the below code. Can someone help me to explain or 'translate' it into a more normal/common expression?
steps = len(t)
sa = [i for i in range(steps)]
sa.sort(key = lambda i: t[i:i + steps])#I know that sa is a list
for i in range(len(sa)):
sf = t[sa[i] : sa[i] + steps]
't' is actually a string
Thank you.

What I don't understand is the code: sa.sort(key = lambda i: t[i:i + steps])`
sa.sort(key = lambda i: t[i:i + steps])
It sorts sa according to the natural ordering of substrings t[i:i+len(t)]. Actually i + steps will always be greater or equal than steps (which is len(t)) so it could be written t[i:] instead (which makes the code simpler to understand)
You will better understand using the decorate/sort/undecorate pattern:
>>> t = "azerty"
>>> sa = range(len(t))
>>> print sa
[0, 1, 2, 3, 4, 5]
>>> decorated = [(t[i:], i) for i in sa]
>>> print decorated
[('azerty', 0), ('zerty', 1), ('erty', 2), ('rty', 3), ('ty', 4), ('y', 5)]
>>> decorated.sort()
>>> print decorated
[('azerty', 0), ('erty', 2), ('rty', 3), ('ty', 4), ('y', 5), ('zerty', 1)]
>>> sa = [i for (_dummy, i) in decorated]
>>> print sa
[0, 2, 3, 4, 5, 1]
and sf = t[sa[i] : sa[i] + steps]
This could also be written more simply:
for i in range(len(sa)):
sf = t[sa[i] : sa[i] + steps]
=>
for x in sa:
sf = t[x:]
print sf
which yields:
azerty
erty
rty
ty
y
zerty
You'll notice that this is exactly the keys used (and then discarded)
in the decorate/sort/undecorate example above, so the whole thing could be rewritten as:
def foo(t):
decorated = sorted((t[i:], i) for i in range(len(t)))
for sf, index in decorated:
print sf
# do something with sf here
As to what all this is supposed to do, I'm totally at lost, but at least you now have a much more pythonic (readable...) version of this code ;)

The lambda in sort defines the criteria according to which the list is going to be sorted.
In other words, the list will not be sorted simply according to its values, but according to the function applied to the values.
Have a look here for more details.
It looks like what you are doing is sorting the list according to the alphabetical ordering of the substrings of the input string t.
Here is what is happening:
t = 'hello' # EXAMPLE
steps = len(t)
sa = [i for i in range(steps)]
sort_func = lambda i: t[i:i + steps]
for el in sa:
print sort_func(el)
#ello
#hello
#llo
#lo
#o
So these are the values that determines the sorting of the list.
transf_list = [sort_func(el) for el in sa]
sorted(transf_list)
# ['ello', 'hello', 'llo', 'lo', 'o']
Hence:
sa.sort(key = sort_func)#I know that sa is a list
# [1, 0, 2, 3, 4]

compare to lists and return the different indices and elements in python

I want to compare to lists and return the different indices and elements.
So I write the following code:
l1 = [1,1,1,1,1]
l2 = [1,2,1,1,3]
ind = []
diff = []
for i in range(len(l1)):
if l1[i] != l2[i]:
ind.append(i)
diff.append([l1[i], l2[i]])
print ind
print diff
# output:
# [1, 4]
# [[1, 2], [1, 3]]
The code works, but are there any better ways to do that?
Update the Question:
I want to ask for another solutions, for example with the iterator, or ternary expression like [a,b](expression) (Not the easiest way like what I did. I want to exclude it.) Thanks very much for the patient! :)

You could use a list comprehension to output all the information in a single list.
>>> [[idx, (i,j)] for idx, (i,j) in enumerate(zip(l1, l2)) if i != j]
[[1, (1, 2)], [4, (1, 3)]]
This will produce a list where each element is: [index, (first value, second value)] so all the information regarding a single difference is together.

An alternative way is the following
>>> l1 = [1,1,1,1,1]
>>> l2 = [1,2,1,1,3]
>>> z = zip(l1,l2)
>>> ind = [i for i, x in enumerate(z) if x[0] != x[1]]
>>> ind
[1, 4]
>>> diff = [z[i] for i in ind]
>>> diff
[(1, 2), (1, 3)]
In Python3 you have to add a call to list around zip.

You can try functional style:
res = filter(lambda (idx, x): x[0] != x[1], enumerate(zip(l1, l2)))
# [(1, (1, 2)), (4, (1, 3))]
to unzip res you can use:
zip(*res)
# [(1, 4), ((1, 2), (1, 3))]

sum of products of couples in a list

I want to find out the sum of products of couples in a list.
For example a list is given [1, 2, 3, 4]. What I want to get is answer = 1*2 + 1*3 + 1*4 + 2*3 + 2*4 + 3*4.
I do it using brute-force, it gives me the time-out error for very large lists.
I want an efficient way to do this. Kindly tell me, how can I do this?
Here is my code, this is working but i need more efficient one:
def proSum(list):
count = 0
for i in range(len(list)- 1):
for j in range(i + 1, len(list)):
count += list[i] * list[j]
return count

Here it is:
In [1]: def prodsum(xs):
...: return (sum(xs)**2 - sum(x*x for x in xs)) / 2
...:
In [2]: prodsum([1, 2, 3, 4]) == 1*2 + 1*3 + 1*4 + 2*3 + 2*4 + 3*4
Out[2]: True
Let xs = a1, a2, .., an, then
　　　　(a1+a2+...+an)^2 = 2(a1a2+a1a3+...+an-1an) + (a1^2+...+an^2)
So we have
　　　a1a2+...+an-1an = {(a1+a2+...+an)^2 - (a1^2+...+an^2)}/2
Compare the performance of #georg's method and mine
The result and the test codes as following(The less time used is better):
In [1]: import timeit
In [2]: import matplotlib.pyplot as plt
In [3]: def eastsunMethod(xs):
...: return (sum(xs)**2 - sum(x*x for x in xs)) / 2
...:
In [4]: def georgMethod(given):
...: sum = 0
...: res = 0
...: cur = len(given) - 1
...:
...: while cur >= 0:
...: res += given[cur] * sum
...: sum += given[cur]
...: cur -= 1
...: return res
...:
In [5]: sizes = range(24)
In [6]: esTimes, ggTimes = [], []
In [7]: for s in sizes:
...: t1 = timeit.Timer('eastsunMethod(xs)', 'from __main__ import eastsunMethod;xs=range(2**%d)' % s)
...: t2 = timeit.Timer('georgMethod(xs)', 'from __main__ import georgMethod;xs=range(2**%d)' % s)
...: esTimes.append(t1.timeit(8))
...: ggTimes.append(t2.timeit(8))
In [8]: fig, ax = plt.subplots(figsize=(18, 6));lines = ax.plot(sizes, esTimes, 'r', sizes, ggTimes);ax.legend(lines, ['Eastsun', 'georg'], loc='center');ax.set_xlabel('size');ax.set_ylabel('time');ax.set_xlim([0, 23])

Use itertools.combinations to generate unique pairs:
# gives [(1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4)]
unique_pairs = list(itertools.combinations([1, 2, 3, 4], 2))
Then use a list comprehension to get the product of each pair:
products = [x*y for x, y in unique_pairs] # => [2, 3, 4, 6, 8, 12]
Then use sum to add up your products:
answer = sum(products) # => 35
This can be all wrapped up in a one-liner like so:
answer = sum(x*y for x,y in itertools.combinations([1, 2, 3, 4], 2))
In making it a one-liner the result of combinations is used without casting to a list. Also, the brackets around the list comprehension are discarded, transforming it generator expression.
Note: Eastsun's answer and georg's answer use much better algorithms and will easily outpreform my answer for large lists.

Note: actually #Eastsun's answer is better.
Here's another, more "algorithmical" way to deal with that. Observe that given
a0, a1, ..., an
the desired sum is (due to the distributive law)
a0 (a1 + a2 + ... + an)
+ a1 (a2 + a3 + ... + an)
+ ...
+ an-2 (an-1 + an)
+ an-1 an
which leads to the following algorithm:
let sum be 0 and current be the last element
on each step
multiply sum and current and add to the result
add current to sum
let current be the previous of current
In python:
sum = 0
res = 0
cur = len(given) - 1
while cur >= 0:
res += given[cur] * sum
sum += given[cur]
cur -= 1
print res

def sumOfProductsOfCouples(l):
return sum(l[i-1] * l[i] for i, n in enumerate(l))

With no external library, you can use map and lambda to calculate * pairwise, and then sum everything up
l=[1, 2, 3, 4]
sum(map(lambda x,y:x*y, l, l[1:]+[l[0]]))
But since you are dealing with big data, I suggest you use numpy.
import numpy as np
l = np.array([1, 2, 3, 4])
print sum(l*np.roll(l, 1))
# 24
EDIT: to keep up with the updated question of OP
import numpy as np
l = [1, 2, 3, 4]
sums = 0
while l:
sums+=sum(l.pop(0)*np.array(l))
print sums
#35
So what it does is taking out the first element of list and * with the rest. Repeating until there is nothing to take out from the list.

from itertools import combinations
l=[1, 2, 3, 4]
cnt=0
for x in combinations(l,2):
cnt+=x[0]*x[1]
print (cnt)
Output;
>>>
35
>>>
combinations() will give your pairs like you want. Then do your calculates.
Debug it like;
l=[1, 2, 3, 4]
for x in combinations(l,2):
print (x)
>>>
(1, 2)
(1, 3)
(1, 4)
(2, 3)
(2, 4)
(3, 4)
>>>
See that your pairs are here. Actually you will find the sum of this combinations pairs.

Use the permutations method from the itertools module:
from itertools import *
p = permutations([1, 2, 3, 4], 2) # generate permutations of two
p = [frozenset(sorted(i)) for i in p] # sort items and cast
p = [list(i) for i in set(p)] # remove duplicates, back to lists
total = sum([i[0] * i[1] for i in p]) # 35 your final answer

you can use map, sum functions.
>>> a = [1, 2, 3, 4]
>>> sum(map(sum, [map(lambda e: e*k, l) for k, l in zip(a, (a[start:] for start, _ in enumerate(a, start=1) if start < len(a)))]))
35
Dividing above expression in parts,
>>> a = [1, 2, 3, 4]
>>> c = (a[start:] for start, _ in enumerate(a, start=1) if start < len(a))
>>> sum(map(sum, [map(lambda e: e*k, l) for k, l in zip(a, c)]))
35

how to find the max number of items in a list such that certain pairs are not together in the output?

I have a list of numbers
l = [1,2,3,4,5]
and a list of tuples which describe which items should not be in the output together.
gl_distribute = [(1, 2), (1,4), (1, 5), (2, 3), (3, 4)]
the possible lists are
[1,3]
[2,4,5]
[3,5]
and I want my algorithm to give me the second one [2,4,5]
I was thinking to do it recursively.
In the first case (t1) I call my recursive algorithm with all the items except the 1st, and in the second case (t2) I call it again removing the pairs from gl_distribute where the 1st item appears.
Here is my algorithm
def check_distribute(items, distribute):
i = sorted(items[:])
d = distribute[:]
if not i:
return []
if not d:
return i
if len(remove_from_distribute(i, d)) == len(d):
return i
first = i[0]
rest = items[1:]
distr_without_first = remove_from_distribute([first], d)
t1 = check_distribute(rest, d)
t2 = check_distribute(rest, distr_without_first)
t2.append(first)
if len(t1) >= len(t2):
return t1
else:
return t2
The remove_from_distribute(items, distr_list) removes the pairs from distr_list that include any of the items in items.
def remove_from_distribute(items, distribute_list):
new_distr = distribute_list[:]
for item in items:
for pair in distribute_list:
x, y = pair
if x == item or y == item and pair in new_distr:
new_distr.remove((x,y))
if new_distr:
return new_distr
else:
return []
My output is [4, 5, 3, 2, 1] which obviously is not correct. Can you tell me what I am doing wrong here? Or can you give me a better way to approach this?

I will suggest an alternative approach.
Assuming your list and your distribution are sorted and your list is length of n, and your distribution is length of m.
First, create a list of two tuples with all valid combinations. This should be a O(n^2) solution.
Once you have the list, it's just a simple loop through the valid combination and find the longest list. There are probably some better solutions to further reduce the complexity.
Here are my sample codes:
def get_valid():
seq = [1, 2, 3, 4, 5]
gl_dist = [(1, 2), (1,4), (1, 5), (2, 3), (3, 4)]
gl_index = 0
valid = []
for i in xrange(len(seq)):
for j in xrange(i+1, len(seq)):
if gl_index < len(gl_dist):
if (seq[i], seq[j]) != gl_dist[gl_index] :
valid.append((seq[i], seq[j]))
else:
gl_index += 1
else:
valid.append((seq[i], seq[j]))
return valid
>>>> get_valid()
[(1, 3), (2, 4), (2, 5), (3, 5), (4, 5)]
def get_list():
total = get_valid()
start = total[0][0]
result = [start]
for i, j in total:
if i == start:
result.append(j)
else:
start = i
return_result = list(result)
result = [i, j]
yield return_result
yield list(result)
raise StopIteration
>>> list(get_list())
[[1, 3], [2, 4, 5], [3, 5], [4, 5]]

I am not sure I fully understand your output as I think 4,5 and 5,2 should be possible lists as they are not in the list of tuples:
If so you could use itertools to get the combinations and filter based on the gl_distribute list using sets to see if any two numbers in the different combinations in combs contains two elements that should not be together, then get the max
combs = (combinations(l,r) for r in range(2,len(l)))
final = []
for x in combs:
final += x
res = max(filter(lambda x: not any(len(set(x).intersection(s)) == 2 for s in gl_distribute),final),key=len)
print res
(2, 4, 5)

Sorting in Sparse Matrix

I have a sparse matrix. I need to sort this matrix row-by-row and create another [sparse] matrix.
Code may explain it better:
# for `rand` function, you need newer version of scipy.
from scipy.sparse import *
m = rand(6,6, density=0.6)
d = m.getrow(0)
print d
Output1
(0, 5) 0.874881629788
(0, 4) 0.352559852239
(0, 2) 0.504791645463
(0, 1) 0.885898140175
I have this m matrix. I want to create a new matrix with sorted version of m. The new matrix
contains 0'th row like this.
new_d = new_m.getrow(0)
print new_d
Output2
(0, 1) 0.885898140175
(0, 5) 0.874881629788
(0, 2) 0.504791645463
(0, 4) 0.352559852239
So I can obtain which column is bigger etc:
print new_d.indices
Output3
array([1, 5, 2, 4])
Of course every row should be sorted like above independently.
I have one solution for this problem but it is not elegant.

If you're willing to ignore the zero-value elements of the matrix, the code below should work. It is also much faster than implementations that use the getrow method, which is rather slow.
from itertools import izip
def sort_coo(m):
tuples = izip(m.row, m.col, m.data)
return sorted(tuples, key=lambda x: (x[0], x[2]))
For example:
>>> from numpy.random import rand
>>> from scipy.sparse import coo_matrix
>>>
>>> d = rand(10, 20)
>>> d[d > .05] = 0
>>> s = coo_matrix(d)
>>> sort_coo(s)
[(0, 2, 0.004775589084940246),
(3, 12, 0.029941507166614145),
(5, 19, 0.015030386789436245),
(7, 0, 0.0075044957259399192),
(8, 3, 0.047994403933129481),
(8, 5, 0.049401058471327031),
(9, 15, 0.040011608000125043),
(9, 8, 0.048541825332137023)]
Depending on your needs you may want to tweak the sort keys in the lambda or further process the output. If you want everything in a row indexed dictionary you could do:
from collections import defaultdict
sorted_rows = defaultdict(list)
for i in sort_coo(m):
sorted_rows[i[0]].append((i[1], i[2]))

My bad solution is like this:
from scipy.sparse import coo_matrix
import numpy as np
a = []
for i in xrange(m.shape[0]): # assume m is square matrix.
d = m.getrow(i)
n = len(d.indices)
s = zip([i]*n, d.indices, d.data)
sorted_s = sorted(s, key=lambda v: v[2], reverse=True)
a.extend(sorted_s)
a = np.array(a)
new_m = coo_matrix((a[:,2], (a[:,0], a[:,1])), m.shape)
There can be some simple mistakes above because I have not checked it yet. But the idea is intuitive, I guess. Is there any good solution?
Edit
This new matrix creation may be useless because if you call getrow method then the order is broken again.
Only coo_matrix.col keeps the order.
Another Solution
This one is not exact solution but it may be helpful:
def sortSparseMatrix(m, rev=True, only_indices=True):
""" Sort a sparse matrix and return column index dictionary
"""
col_dict = dict()
for i in xrange(m.shape[0]): # assume m is square matrix.
d = m.getrow(i)
s = zip(d.indices, d.data)
sorted_s = sorted(s, key=lambda v: v[1], reverse=True)
if only_indices:
col_dict[i] = [element[0] for element in sorted_s]
else:
col_dict[i] = sorted_s
return col_dict
>>> print sortSparseMatrix(m)
{0: [5, 1, 0],
1: [1, 3, 5],
2: [1, 2, 3, 4],
3: [1, 5, 2, 4],
4: [0, 3, 5, 1],
5: [3, 4, 2]}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

find the "overlap" between 2 python lists - python

Use python set intersection = set(a) & set(b) a_remainder = set(a) - set(b) b_remainder = set(b) - set(a)

A response from kerio in #python on freenode: [ i for i in itertools.chain.from_iterable([k] * v for k, v in \ (Counter(a) & Counter(b)).iteritems()) ]

Aset = Set(a); Bset = Set(b); a_remainder = a.difference(b); b_remainder = b.difference(a); c = a.intersection(b); But if you need c to have duplicates, and order is important for you, you may look for w:Longest common subsequence problem

Related

Don't understand Python Expression

compare to lists and return the different indices and elements in python

sum of products of couples in a list

how to find the max number of items in a list such that certain pairs are not together in the output?

Sorting in Sparse Matrix

Categories

Resources