python array intersection efficiently

python array intersection efficiently - python

I don't know how to make an intersect between these two arrays:
a = [[125, 1], [193, 1], [288, 23]]
b = [[108, 1], [288, 1], [193, 11]]
result = [[288,24], [193, 12]]
So the intersection is by the first element, the second element of the array is summed, any ideas how to do this efficiently?
Ok i made a mistake for not explaining what i mean about efficient, sorry. Consider the following naive implementation:
a = [[125, 1], [193, 1], [288, 23]]
b = [[108, 1], [288, 1], [193, 11]]
result = {}
for i, j in a:
for k, l in b:
if i == k:
result[i] = j + l
print result
So i was trying to find a way to make more efficient solution to my problem, more pythonic in a way. So that's why i needed your help.
Try this test cases (my code is also on it):
Test Case
Running time: 28.6980509758

This data seems like it would be better stored as a dictionary
da = dict(a)
db = dict(b)
once you have it like that you can just:
result = [[k, da[k] + db[k]] for k in set(da.keys()).intersection(db.keys())]
or in python 2.7 you can also just use viewkeys instead of a set
result = [[k, da[k] + db[k]] for k in da.viewkeys() & db]

result = []
ms, mb = (dict(a),dict(b)) if len(a)<len(b) else (dict(b),dict(a))
for k in ms:
if k in mb:
result.append([k,ms[k]+mb[k]])

Use a counter:
c_sum = Counter()
c_len = Counter()
for elt in a:
c_sum[elt[0]] += elt[1]
c_len[elt[0]] += 1
for elt in b:
c_sum[elt[0]] += elt[1]
c_len[elt[0]] += 1
print [[k, c_sum[k]] for k, v in c_len.iteritems() if v > 1]

Here you go
a = [[125, 1], [193, 1], [288, 23]]
b = [[108, 1], [288, 1], [193, 11]]
for e in a:
for e2 in b:
if e[0] == e2[0]:
inter.append([e[0], e[1]+e2[1]])
print inter
Outputs
[[193, 12], [288, 24]]

This solution works if it you also want duplicate items within the lists to be counted.
from collections import defaultdict
a = [[125, 1], [193, 1], [288, 23]]
b = [[108, 1], [288, 1], [193, 11]]
d = defaultdict(int)
for value, num in a+b:
d[value] += num
result = filter(lambda x:x[1]>1, d.items())
result = map(list, result) # If it's important that the result be a list of lists rather than a list of tuples
print result
# [[288, 24], [193, 12]]

In first place, Python does not have arrays. It has lists. Just a matter of name, but it can be confusing. The one-liner for this is:
[ [ae[0],ae[1]+be[1]] for be in b for ae in a if be[0] == ae[0] ]
PS: As you say "intersection", I assume the lists are set-like ("bags", really), and that , as bags, they are properly normalized (i.e. they don't have repeated elements/keys).

Here is how I would approach it, assuming uniqueness on a and b:
k = {} # store totals
its = {} # store intersections
for i in a + b:
if i[0] in k:
its[i[0]] = True
k[i[0]] += i[1]
else:
k[i[0]] = i[1]
# then loop through intersections for results
result = [[i, k[i]] for i in its]

I got:
from collections import defaultdict
d = defaultdict(list)
for series in a, b:
for key, value in series:
d[key].append(value)
result2 = [(key, sum(values)) for key, values in d.iteritems() if len(values) > 1]
which runs in O(len(a)+len(b)), or about 0.02 seconds on my laptop vs 18.79 for yours. I also confirmed that it returned the same results as result.items() from your algorithm.

This solution might not be the fastest, but it's probably the simplest implementation, so I decided to post it, for completeness.
aa = Counter(dict(a))
bb = Counter(dict(b))
cc = aa + bb
cc
=> Counter({288: 24, 193: 12, 108: 1, 125: 1})
list(cc.items())
=> [(288, 24), (193, 12), (108, 1), (125, 1)]
If you must only include the common keys:
[ (k, cc[k]) for k in set(aa).intersection(bb) ]
=> [(288, 24), (193, 12)]

numpy serachsorted(), argsort(), and intersect1d() are possible alternatives and can be quite fast. This example should also take care of non-unique first element issue.
>>> import numpy as np
>>> a=np.array([[125, 1], [193, 1], [288, 23]])
>>> b=np.array([[108, 1], [288, 1], [193, 11]])
>>> aT=a.T
>>> bT=b.T
>>> aS=aT[0][np.argsort(aT[0])]
>>> bS=bT[0][np.argsort(bT[0])]
>>> i=np.intersect1d(aT[0], bT[0])
>>> cT=np.hstack((aT[:,np.searchsorted(aS, i)], bT[:,np.searchsorted(bS, i)]))
>>> [[item,np.sum(cT[1,np.argwhere(cT[0]==item).flatten()])] for item in i]
[[193, 12], [288, 24]] #not quite happy about this, can someone comes up with a vectorized way of doing it?

Related

More "pythonic" way to show a 4d matrix in 2d

I would like to plot a 4d matrix as a 2d matrix with indices:
[i][j][k][l] --> [i * nj + j][ k * nl + l]
I have a working version here.
This is working as I want, but it's not very elegant. I looked into "reshape" but this is not exactly what I'm looking for, or perhaps I am using it incorrectly.
Given a 4d array "r" with shape (100000,4), the relevant snippet I want to replace is:
def transform(i,j,k,l, s1, s2):
return [i * s1 + j, k * s2 + l]
nx = 5
ny = 11
iedges=np.linspace(0,100, nx)
jedges=np.linspace(0, 20, ny)
bins = ( iedges,jedges,iedges,jedges )
H, edges = np.histogramdd(r, bins=bins )
H2 = np.zeros(( (nx-1)*(ny-1),(nx-1)*(ny-1)))
for i in range(nx-1):
for j in range(ny-1):
for k in range(nx-1):
for l in range(ny-1):
x,y = transform(i,j,k,l,ny-1,ny-1)
H2[x][y] = H[i][j][k][l]
In this case the values of H2 will correspond to the values of H, but the entry i,j,k,l will display as i*ny + j, k * ny + l.
Example plot:

Are you sure reshape doesn't work?
I ran your code on a small random r. The nonzero terms of H are:
In [13]: np.argwhere(H)
Out[13]:
array([[0, 9, 3, 1],
[1, 1, 1, 2],
[1, 2, 1, 3],
[2, 2, 2, 3],
[3, 1, 1, 8]])
and for the transformed H2:
In [14]: np.argwhere(H2)
Out[14]:
array([[ 9, 31],
[11, 12],
[12, 13],
[22, 23],
[31, 18]])
And one of the H indices transforms to H2 indices with:
In [16]: transform(0,9,3,1,4,10)
Out[16]: [9, 31]
If I simply reshape H, I get the same array as H2:
In [17]: H3=H.reshape(40,40)
In [18]: np.argwhere(H3)
Out[18]:
array([[ 9, 31],
[11, 12],
[12, 13],
[22, 23],
[31, 18]])
In [19]: np.allclose(H2,H3)
Out[19]: True
So without delving into the details of your code, it looks to me like a simple reshape.

Looks like you can calculate i,j,k,l from x,y? This should be something like:
from functools import partial
def get_ijkl(x, y, s1, s2):
# "Reverse" of `transform`
i, j = divmod(x, s1)
k, l = divmod(y, s2)
return (i, j, k, l)
def get_2d_val(x, y, s1, s2, four_dim_array):
return four_dim_array[get_ijkl(x, y, s1, s2)]
smaller_shape = ((nx-1)*(ny-1), (nx-1)*(ny-1))
Knowing this there are several approaches possible:
numpy.fromfunction:
H3 = np.fromfunction(
partial(get_2d_val, s1=ny-1, s2=ny-1, four_dim_array=H),
shape=smaller_shape,
dtype=int,
)
assert np.all(H2 == H3)
by indexing:
indices_to_take = np.array([
[list(get_ijkl(x, y, ny-1, ny-1)) for x in range(smaller_shape[0])] for y in range(smaller_shape[1])
]).transpose()
H4 = H[tuple(indices_to_take)]
assert np.all(H2 == H4)
as answered by #hpaulj you can simply reshape array and it will be faster. But If you have some different transform and can calculate appropriate "reverse" function then using fromfunction or custom indexing will get useful

Finding consecutive duplicates and listing their indexes of where they occur in python

I have a list in python for example:
mylist = [1,1,1,1,1,1,1,1,1,1,1,
0,0,1,1,1,1,0,0,0,0,0,
1,1,1,1,1,1,1,1,0,0,0,0,0,0]
my goal is to find where there are five or more zeros in a row and then list the indexes of where this happens, for example the output for this would be:
[17,21][30,35]
here is what i have tried/seen in other questions asked on here:
def zero_runs(a):
# Create an array that is 1 where a is 0, and pad each end with an extra 0.
iszero = np.concatenate(([0], np.equal(a, 0).view(np.int8), [0]))
absdiff = np.abs(np.diff(iszero))
# Runs start and end where absdiff is 1.
ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
return ranges
runs = zero_runs(list)
this gives output:
[0,10]
[11,12]
...
which is basically just listing indexes of all duplicates, how would i go about separating this data into what i need

You could use itertools.groupby, it will identify the contiguous groups in the list:
from itertools import groupby
lst = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
groups = [(k, sum(1 for _ in g)) for k, g in groupby(lst)]
cursor = 0
result = []
for k, l in groups:
if not k and l >= 5:
result.append([cursor, cursor + l - 1])
cursor += l
print(result)
Output
[[17, 21], [30, 35]]

Your current attempt is very close. It returns all of the runs of consecutive zeros in an array, so all you need to accomplish is adding a check to filter runs of less than 5 consecutive zeros out.
def threshold_zero_runs(a, threshold):
iszero = np.concatenate(([0], np.equal(a, 0).view(np.int8), [0]))
absdiff = np.abs(np.diff(iszero))
ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
m = (np.diff(ranges, 1) >= threshold).ravel()
return ranges[m]
array([[17, 22],
[30, 36]], dtype=int64)

Use the shift operator on the array. Compare the shifted version with the original. Where they do not match, you have a transition. You then need only to identify adjacent transitions that are at least 5 positions apart.
Can you take it from there?

Another way using itertools.groupby and enumerate.
First find the zeros and the indices:
from operator import itemgetter
from itertools import groupby
zerosList = [
list(map(itemgetter(0), g))
for i, g in groupby(enumerate(mylist), key=itemgetter(1))
if not i
]
print(zerosList)
#[[11, 12], [17, 18, 19, 20, 21], [30, 31, 32, 33, 34, 35]]
Now just filter zerosList:
runs = [[x[0], x[-1]] for x in zerosList if len(x) >= 5]
print(runs)
#[[17, 21], [30, 35]]

Python Subtract Arrays Based on Same Time

Is there a way I can subtract two arrays, but making sure I am subtracting elements that have the same day, hour, year, and or minute values?
list1 = [[10, '2013-06-18'],[20, '2013-06-19'], [50, '2013-06-23'], [15, '2013-06-30']]
list2 = [[5, '2013-06-18'], [5, '2013-06-23'] [20, '2013-06-25'], [20, '2013-06-30']]
Looking for:
list1-list2 = [[5, '2013-06-18'], [45, '2013-06-23'] [10, '2013-06-30']]

How about using a defaultdict of lists?
import itertools
from operator import sub
from collections import defaultdict
def subtract_lists(l1, l2):
data = defaultdict(list)
for sublist in itertools.chain(l1, l2):
value, date = sublist
data[date].append(value)
return [(reduce(sub, v), k) for k, v in data.items() if len(v) > 1]
list1 = [[10, '2013-06-18'],[20, '2013-06-19'], [50, '2013-06-23'], [15, '2013-06-30']]
list2 = [[5, '2013-06-18'], [5, '2013-06-23'], [20, '2013-06-25'], [20, '2013-06-30']]
>>> subtract_lists(list1, list2)
[(-5, '2013-06-30'), (45, '2013-06-23'), (5, '2013-06-18')]
>>> # if you want them sorted by date...
>>> sorted(subtract_lists(list1, list2), key=lambda t: t[1])
[(5, '2013-06-18'), (45, '2013-06-23'), (-5, '2013-06-30')]
Note that the difference for date 2013-06-30 is -5, not +5.
This works by using the date as a dictionary key for a list of all values for the given date. Then those lists having more than one value in its list are selected, and the values in those lists are reduced by subtraction. If you want the resulting list sorted, you can do so using sorted() with the date item as the key. You could move that operation into the subtract_lists() function if you always want that behavior.

I think this code does what you want:
list1 = [[10, '2013-06-18'],[20, '2013-06-19'], [50, '2013-06-23'], [15, '2013-06-30']]
list2 = [[5, '2013-06-18'], [5, '2013-06-23'], [20, '2013-06-25'], [20, '2013-06-30']]
list1=dict([[i[1],i[0]] for i in list1])
list2=dict([[i[1],i[0]] for i in list2])
def minus(a,b):
return { k: a.get(k, 0) - b.get(k, 0) for k in set(a) & set(b) }
minus(list2,list1)
# returns the below, which is now a dictionary
{'2013-06-18': 5, '2013-06-23': 45, '2013-06-30': 5}
# you can convert it back into your format like this
data = [[value,key] for key, value in minus(list1,list2).iteritems()]
But you seem to have an error in your output data. If you want to include data when it's in either list, define minus like this instead:
def minus(a,b):
return { k: a.get(k, 0) - b.get(k, 0) for k in set(a) | set(b) }
See this answer, on Merge and sum of two dictionaries, for more info.

Checking for and indexing non-unique/duplicate values in a numpy array

I have an array traced_descIDs containing object IDs and I want to identify which items are not unique in this array. Then, for each unique duplicate (careful) ID, I need to identify which indices of traced_descIDs are associated with it.
As an example, if we take the traced_descIDs here, I want the following process to occur:
traced_descIDs = [1, 345, 23, 345, 90, 1]
dupIds = [1, 345]
dupInds = [[0,5],[1,3]]
I'm currently finding out which objects have more than 1 entry by:
mentions = np.array([len(np.argwhere( traced_descIDs == i)) for i in traced_descIDs])
dupMask = (mentions > 1)
however, this takes too long as len( traced_descIDs ) is around 150,000. Is there a faster way to achieve the same result?
Any help greatly appreciated. Cheers.

While dictionaries are O(n), the overhead of Python objects sometimes makes it more convenient to use numpy's functions, which use sorting and are O(n*log n). In your case, the starting point would be:
a = [1, 345, 23, 345, 90, 1]
unq, unq_idx, unq_cnt = np.unique(a, return_inverse=True, return_counts=True)
If you are using a version of numpy earlier than 1.9, then that last line would have to be:
unq, unq_idx = np.unique(a, return_inverse=True)
unq_cnt = np.bincount(unq_idx)
The contents of the three arrays we have created are:
>>> unq
array([ 1, 23, 90, 345])
>>> unq_idx
array([0, 3, 1, 3, 2, 0])
>>> unq_cnt
array([2, 1, 1, 2])
To get the repeated items:
cnt_mask = unq_cnt > 1
dup_ids = unq[cnt_mask]
>>> dup_ids
array([ 1, 345])
Getting the indices is a little more involved, but pretty straightforward:
cnt_idx, = np.nonzero(cnt_mask)
idx_mask = np.in1d(unq_idx, cnt_idx)
idx_idx, = np.nonzero(idx_mask)
srt_idx = np.argsort(unq_idx[idx_mask])
dup_idx = np.split(idx_idx[srt_idx], np.cumsum(unq_cnt[cnt_mask])[:-1])
>>> dup_idx
[array([0, 5]), array([1, 3])]

There is scipy.stats.itemfreq which would give the frequency of each item:
>>> xs = np.array([1, 345, 23, 345, 90, 1])
>>> ifreq = sp.stats.itemfreq(xs)
>>> ifreq
array([[ 1, 2],
[ 23, 1],
[ 90, 1],
[345, 2]])
>>> [(xs == w).nonzero()[0] for w in ifreq[ifreq[:,1] > 1, 0]]
[array([0, 5]), array([1, 3])]

Your current approach is O(N**2), use a dictionary to do it in O(N)time:
>>> from collections import defaultdict
>>> traced_descIDs = [1, 345, 23, 345, 90, 1]
>>> d = defaultdict(list)
>>> for i, x in enumerate(traced_descIDs):
... d[x].append(i)
...
>>> for k, v in d.items():
... if len(v) == 1:
... del d[k]
...
>>> d
defaultdict(<type 'list'>, {1: [0, 5], 345: [1, 3]})
And to get the items and indices:
>>> from itertools import izip
>>> dupIds, dupInds = izip(*d.iteritems())
>>> dupIds, dupInds
((1, 345), ([0, 5], [1, 3]))
Note that if you want to preserver the order of items in dupIds then use collections.OrderedDict and dict.setdefault() method.

td = np.array(traced_descIDs)
si = np.argsort(td)
td[si][np.append(False, np.diff(td[si]) == 0)]
That gives you:
array([ 1, 345])
I haven't figured out the second part quite yet, but maybe this will be inspiration enough for you, or maybe I'll get back to it. :)

A solution of the same vectorized efficiency as proposed by Jaime is embedded in the numpy_indexed package (disclaimer: I am its author):
import numpy_indexed as npi
print(npi.group_by(traced_descIDs, np.arange(len(traced_descIDs))))
This gets us most of the way there; but if we also want to filter out singleton groups while avoiding any python loops and staying entirely vectorized, we can go a little lower level, and do:
g = npi.group_by(traced_descIDs)
unique = g.unique
idx = g.split_array_as_list(np.arange(len(traced_descIDs)))
duplicates = unique[g.count>1]
idx_duplicates = np.asarray(idx)[g.count>1]
print(duplicates, idx_duplicates)

np.unqiue for Ndims
I had a similar problem with an ndArray in which I want to find which rows are duplicated.
x = np.arange(60).reshape(5,4,3)
x[1] = x[0]
0 and 1 should be duplicates in axis 0. I used np.unique and returned all options. Then use Jaime's method to locate the duplicates.
_,i,_,c = np.unique(x,1,1,1,axis=0)
x_dup = x[i[1<c]]
I unnecessarily use return_inverse for clarity. Here are the result:
>>> print(x_dupilates)
[[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]]

Python sort array by another positions array

Assume I have two arrays, the first one containing int data, the second one containing positions
a = [11, 22, 44, 55]
b = [0, 1, 10, 11]
i.e. I want a[i] to be be moved to position b[i] for all i. If I haven't specified a position, then insert a -1
i.e
sorted_a = [11, 22,-1,-1,-1,-1,-1,-1,-1,-1, 44, 55]
^ ^ ^ ^
0 1 10 11
Another example:
a = [int1, int2, int3]
b = [5, 3, 1]
sorted_a = [-1, int3, -1, int2, -1, int1]
Here's what I've tried:
def sort_array_by_second(a, b):
sorted = []
for e1 in a:
sorted.appendAt(b[e1])
return sorted
Which I've obviously messed up.

Something like this:
res = [-1]*(max(b)+1) # create a list of required size with only -1's
for i, v in zip(b, a):
res[i] = v
The idea behind the algorithm:
Create the resulting list with a size capable of holding up to the largest index in b
Populate this list with -1
Iterate through b elements
Set elements in res[b[i]] with its proper value a[i]
This will leave the resulting list with -1 in every position other than the indexes contained in b, which will have their corresponding value of a.

I would use a custom key function as an argument to sort. This will sort the values according to the corresponding value in the other list:
to_be_sorted = ['int1', 'int2', 'int3', 'int4', 'int5']
sort_keys = [4, 5, 1, 2, 3]
sort_key_dict = dict(zip(to_be_sorted, sort_keys))
to_be_sorted.sort(key = lambda x: sort_key_dict[x])
This has the benefit of not counting on the values in sort_keys to be valid integer indexes, which is not a very stable thing to bank on.

>>> a = ["int1", "int2", "int3", "int4", "int5"]
>>> b = [4, 5, 1, 2, 3]
>>> sorted(a, key=lambda x, it=iter(sorted(b)): b.index(next(it)))
['int4', 'int5', 'int1', 'int2', 'int3']

Paulo Bu answer is the best pythonic way. If you want to stick with a function like yours:
def sort_array_by_second(a, b):
sorted = []
for n in b:
sorted.append(a[n-1])
return sorted
will do the trick.

Sorts A by the values of B:
A = ['int1', 'int2', 'int3', 'int4', 'int5']
B = [4, 5, 1, 2, 3]
from operator import itemgetter
C = [a for a, b in sorted(zip(A, B), key = itemgetter(1))]
print C
Output
['int3', 'int4', 'int5', 'int1', 'int2']

a = [11, 22, 44, 55] # values
b = [0, 1, 10, 11] # indexes to sort by
sorted_a = [-1] * (max(b) + 1)
for index, value in zip(b, a):
sorted_a[index] = value
print(sorted_a)
# -> [11, 22, -1, -1, -1, -1, -1, -1, -1, -1, 44, 55]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python array intersection efficiently - python

result = [] ms, mb = (dict(a),dict(b)) if len(a)<len(b) else (dict(b),dict(a)) for k in ms: if k in mb: result.append([k,ms[k]+mb[k]])

Use a counter: c_sum = Counter() c_len = Counter() for elt in a: c_sum[elt[0]] += elt[1] c_len[elt[0]] += 1 for elt in b: c_sum[elt[0]] += elt[1] c_len[elt[0]] += 1 print [[k, c_sum[k]] for k, v in c_len.iteritems() if v > 1]

Here you go a = [[125, 1], [193, 1], [288, 23]] b = [[108, 1], [288, 1], [193, 11]] for e in a: for e2 in b: if e[0] == e2[0]: inter.append([e[0], e[1]+e2[1]]) print inter Outputs [[193, 12], [288, 24]]

Here is how I would approach it, assuming uniqueness on a and b: k = {} # store totals its = {} # store intersections for i in a + b: if i[0] in k: its[i[0]] = True k[i[0]] += i[1] else: k[i[0]] = i[1] # then loop through intersections for results result = [[i, k[i]] for i in its]

Related

More "pythonic" way to show a 4d matrix in 2d

Finding consecutive duplicates and listing their indexes of where they occur in python

Python Subtract Arrays Based on Same Time

Checking for and indexing non-unique/duplicate values in a numpy array

Python sort array by another positions array

Categories

Resources