How to remove non-symmetric pairs in a numpy array? - python

Given a numpy Nx2 numpy array data of ints (we can assume that data has no duplicate rows), I need to keep only the rows whose elements satisfy the relationship
(data[i,0] == data[j,1]) & (data[i,1] == data[j,0])
For instance with
import numpy as np
data = np.array([[1, 2],
[2, 1],
[7, 3],
[6, 6],
[5, 6]])
I should return
array([[1, 2], # because 2,1 is present
[2, 1], # because 1,2 is present
[6, 6]]) # because 6,6 is present
One verbose way to do this is
def filter_symmetric_pairs(data):
result = np.empty((0,2))
for i in range(len(data)):
for j in range(len(data)):
if (data[i,0] == data[j,1]) & (data[i,1] == data[j,0]):
result = np.vstack([result, data[i,:]])
return result
and I came up with a more concise:
def filter_symmetric_pairs(data):
return data[[row.tolist() in data[:,::-1].tolist() for row in data]]
Can somebody suggest a better numpy idiom?

Here are a couple of different methods you may use to do that. The first one is the "obvious" quadratic solution, which is simple but may give you trouble if you have a big input array. The second one should work as long as you don't have a huge range of numbers in the input, and it has the advantage of working with a linear amount of memory.
import numpy as np
# Input data
data = np.array([[1, 2],
[2, 1],
[7, 3],
[6, 6],
[5, 6]])
# Method 1 (quadratic memory)
d0, d1 = data[:, 0, np.newaxis], data[:, 1]
# Compare all values in first column to all values in second column
c = d0 == d1
# Find where comparison matches both ways
c &= c.T
# Get matching elements
res = data[c.any(0)]
print(res)
# [[1 2]
# [2 1]
# [6 6]]
# Method 2 (linear memory)
# Convert pairs into single values
# (assumes positive values, otherwise shift first)
n = data.max() + 1
v = data[:, 0] + (n * data[:, 1])
# Symmetric values
v2 = (n * data[:, 0]) + data[:, 1]
# Find where symmetric is present
m = np.isin(v2, v)
res = data[m]
print(res)
# [[1 2]
# [2 1]
# [6 6]]

You can sort the arrays preserving the row contents using argsort for both the original and reversed arrays, then just check which rows are equal and use that as a mask for slice data.
import numpy as np
data = np.array([[1, 2],
[2, 1],
[7, 3],
[6, 6],
[5, 6]])
data_r = data[:,::-1]
sorter = data.argsort(axis=0)[:,0]
sorter_r = data_r.argsort(axis=0)[:,0]
mask = (data.take(sorter, axis=0) == data_r.take(sorter_r, axis=0)).all(axis=1)
data[mask]
# returns:
array([[1, 2],
[2, 1],
[6, 6]])

Another solution dawned on me, which sees data as the edge list of a directed graph and filters only bidirected edges (my problem is thus equivalent to detecting mutual edges in a graph):
def filter_symmetric_pairs(data):
rank = max(data.flatten() + 1)
adj = np.zeros((rank, rank))
adj[data[:,0], data[:,1]] = 1 # treat the coordinates as edges of directed graph, compute adjaciency matrix
bidirected_edges = (adj == adj.T) & (adj == 1) # impose symmetry and a nonzero value
return np.vstack(np.nonzero(bidirected_edges)).T # list indices of components satisfying the above constraint

Related

Efficient way to find all the pairs in a list without using nested loop

Suppose I have a list that stores many 2D points. In this list, some positions are stored the same points, consider the index of positions that stored the same point as an index pair. I want to find all the pairs in the list and return all 2 by 2 index pairs. It is possible that the list has some points repeated more than two times, but only the first match needs to be treated as a pair.
For example, in the below list, I have 9 points in total and there are 5 positions containing repeated points. The indices 0, 3, and 7 store the same point ([1, 1]), and the indicies 1 and 6 store the same point ([2, 3]).
[[1, 1], [2, 3], [1, 4], [1, 1], [10, 3], [5, 2], [2, 3], [1, 1], [3, 4]]
So, for this list, I want to return the index pair as (index 0, index 3) and (index 1, index 6). The only solution I can come up with is doing this is through nested loops, which I code up as following
A = np.array([[1, 1], [2, 3], [1, 4], [1, 1], [10, 3], [5, 2], [2, 3], [1, 1], [3, 4]], dtype=int)
# I don't want to modified the original list, looping through a index list insted.
Index = np.arange(0, A.shape[0], 1, dtype=int)
Pair = [] # for store the index pair
while Index.size != 0:
current_index = Index[0]
pi = A[current_index]
Index = np.delete(Index, 0, 0)
for j in range(Index.shape[0]):
pj = A[Index[j]]
distance = linalg.norm(pi - pj, ord=2, keepdims=True)
if distance == 0:
Pair.append([current_index, Index[j]])
Index = np.delete(Index, j, 0)
break
While this code works for me but the time complexity is O(n^2), where n == len(A), I'm wondering if is there any more efficient way to do this job with a lower time complexity. Thanks for any ideas and help.
You can use a dictionary to keep track of the indices for each point.
Then, you can iterate over the items in the dictionary, printing out the indices corresponding to points that appear more than once. The runtime of this procedure is linear, rather than quadratic, in the number of points in A:
points = {}
for index, point in enumerate(A):
point_tuple = tuple(point)
if point_tuple not in points:
points[point_tuple] = []
points[point_tuple].append(index)
for point, indices in points.items():
if len(indices) > 1:
print(indices)
This prints out:
[0, 3, 7]
[1, 6]
If you only want the first two indices where a point appears, you can use print(indices[:2]) rather than print(indices).
This is similar to the other answer, but since you only want the first two in the event of multiple pairs you can do it in a single iteration. Add the indices under the appropriate key in a dict and yield the indices if (and only if) there are two points:
from collections import defaultdict
l = [[1, 1], [2, 3], [1, 4], [1, 1], [10, 3], [5, 2], [2, 3], [1, 1], [3, 4]]
def get_pairs(l):
ind = defaultdict(list)
for i, pair in enumerate(l):
t = tuple(pair)
ind[t].append(i)
if len(ind[t]) == 2:
yield list(ind[t])
list(get_pairs(l))
# [[0, 3], [1, 6]]
One pure-Numpy solution without loops (the only one so far) is to use np.unique twice with a trick that consists in removing the first items found between the two searches. This solution assume a sentinel can be set (eg. -1, the minimum value of an integer, NaN) which is generally not a problem (you can use bigger types if needed).
A = np.array([[1, 1], [2, 3], [1, 4], [1, 1], [10, 3], [5, 2], [2, 3], [1, 1], [3, 4]], dtype=int)
# Copy the array not to mutate it
tmp = A.copy()
# Find the location of unique values
pair1, index1 = np.unique(tmp, return_index=True, axis=0)
# Discard the element found assuming -1 is never stored in A
INT_MIN = np.iinfo(A.dtype).min
tmp[index1] = INT_MIN
# Find the location of duplicated values
pair2, index2 = np.unique(tmp, return_index=True, axis=0)
# Extract the indices that share the same pair of values found
left = index1[np.isin(pair1, pair2).all(axis=1)]
right = index2[np.isin(pair2, pair1).all(axis=1)]
# Combine the each left index with each right index
result = np.hstack((left[:,None], right[:,None]))
# result = array([[0, 3],
# [1, 6]])
This solution should run in O(n log n) time as np.unique uses a basic sort internally (more specifically quick-sort).

NumPy apply function to groups of rows corresponding to another numpy array

I have a NumPy array with each row representing some (x, y, z) coordinate like so:
a = array([[0, 0, 1],
[1, 1, 2],
[4, 5, 1],
[4, 5, 2]])
I also have another NumPy array with unique values of the z-coordinates of that array like so:
b = array([1, 2])
How can I apply a function, let's call it "f", to each of the groups of rows in a which correspond to the values in b? For example, the first value of b is 1 so I would get all rows of a which have a 1 in the z-coordinate. Then, I apply a function to all those values.
In the end, the output would be an array the same shape as b.
I'm trying to vectorize this to make it as fast as possible. Thanks!
Example of an expected output (assuming that f is count()):
c = array([2, 2])
because there are 2 rows in array a which have a z value of 1 in array b and also 2 rows in array a which have a z value of 2 in array b.
A trivial solution would be to iterate over array b like so:
for val in b:
apply function to a based on val
append to an array c
My attempt:
I tried doing something like this, but it just returns an empty array.
func(a[a[:, 2]==b])
The problem is that the groups of rows with the same Z can have different sizes so you cannot stack them into one 3D numpy array which would allow to easily apply a function along the third dimension. One solution is to use a for-loop, another is to use np.split:
a = np.array([[0, 0, 1],
[1, 1, 2],
[4, 5, 1],
[4, 5, 2],
[4, 3, 1]])
a_sorted = a[a[:,2].argsort()]
inds = np.unique(a_sorted[:,2], return_index=True)[1]
a_split = np.split(a_sorted, inds)[1:]
# [array([[0, 0, 1],
# [4, 5, 1],
# [4, 3, 1]]),
# array([[1, 1, 2],
# [4, 5, 2]])]
f = np.sum # example of a function
result = list(map(f, a_split))
# [19, 15]
But imho the best solution is to use pandas and groupby as suggested by FBruzzesi. You can then convert the result to a numpy array.
EDIT: For completeness, here are the other two solutions
List comprehension:
b = np.unique(a[:,2])
result = [f(a[a[:,2] == z]) for z in b]
Pandas:
df = pd.DataFrame(a, columns=list('XYZ'))
result = df.groupby(['Z']).apply(lambda x: f(x.values)).tolist()
This is the performance plot I got for a = np.random.randint(0, 100, (n, 3)):
As you can see, approximately up to n = 10^5 the "split solution" is the fastest, but after that the pandas solution performs better.
If you are allowed to use pandas:
import pandas as pd
df=pd.DataFrame(a, columns=['x','y','z'])
df.groupby('z').agg(f)
Here f can be any custom function working on grouped data.
Numeric example:
a = np.array([[0, 0, 1],
[1, 1, 2],
[4, 5, 1],
[4, 5, 2]])
df=pd.DataFrame(a, columns=['x','y','z'])
df.groupby('z').size()
z
1 2
2 2
dtype: int64
Remark that .size is the way to count number of rows per group.
To keep it into pure numpy, maybe this can suit your case:
tmp = np.array([a[a[:,2]==i] for i in b])
tmp
array([[[0, 0, 1],
[4, 5, 1]],
[[1, 1, 2],
[4, 5, 2]]])
which is an array with each group of arrays.
c = np.array([])
for x in np.nditer(b):
c = np.append(c, np.where((a[:,2] == x))[0].shape[0])
Output:
[2. 2.]

Numpy - Try to split an array according to a monotonic condition

I have a numpy array that is a sequence of (x, y) coordinates. I'm trying to split it according to a monotonic condition. To exemplify this:
cords = np.array([[1,1],[2,3],[2,4],[2,5],[4,3],[4,5],[4,6],[4,7],[5,7],[5,5]])
I would like split the array and make sure for each sub array x is monotonic (appear once). The results should be:
cord1 = np.array([[1,1],[2,3],[4,3],[5,7])
cord2 = np.array([[2,4],[4,5],[5,5])
cord3 = np.array([[2,5],[4,6]])
cord4 = np.array([[4,7]])
Any help is appreciated.
You will have to do this iteratively by extracting monotonic coordinates progressively from the remainder of your array:
import numpy as np
cords = np.array([[1,1],[2,3],[2,4],[2,5],[4,3],[4,5],[4,6],[4,7],[5,7],[5,5]])
result = []
while cords.size>0:
mask = np.insert(cords[:-1,0] != cords[1:,0],0,[True])
result.append(cords[mask,:])
cords = cords[mask==False,:]
Output:
for mono in result: print(list(map(list,mono)))
[[1, 1], [2, 3], [4, 3], [5, 7]]
[[2, 4], [4, 5], [5, 5]]
[[2, 5], [4, 6]]
[[4, 7]]
note: this assumes that the points are in order of their x coordinate. You will need to sort them beforehand if that is not the case.

Reverse order of some elements in Tensorflow

Say I have a tensor DATA of shape (M, N, 2).
I also have another tensor IND of shape (N) consisting of zeros and ones.
If IND(i)==1 then DATA(:,i,0) and DATA(:,i,1) have to swap. If IND(i)==0 they won't swap.
How can I do this? I know that this can be done via tf.gather_nd, but I have no idea how.
Here is one possible solution with tf.equal, tf.where, tf.scater_nd_update, tf.gather_nd and tf.reverse_v2:
data = tf.Variable([[[1, 2],
[2, 3],
[3, 4],
[4, 5],
[5, 6]]]) # shape=(1,5,2)
# reverse elements where ind is 1
ind = tf.constant([1, 0, 1, 0, 1]) # shape(5,)
cond = tf.where(tf.equal([ind], 1))
match_data = tf.gather_nd(data, cond)
rev_match_data = tf.reverse_v2(match_data, axis=[-1])
data = tf.scatter_nd_update(data, cond, rev_match_data)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print(sess.run(data))
#[[[2 1]
# [2 3]
# [4 3]
# [4 5]
# [6 5]]]
One way which does not use tf.gather_ind is as follows. The idea is to build DATA1, which is DATA with all possible swaps (i.e. the result of swapping if IND had been a vector of 1s), and use masks to choose the correct values from either Data or Data1 depending on whether a swap is needed or not.
DATA1 = tf.concat([tf.reshape(DATA[:,:,1], [M, N, 1]), tf.reshape(DATA[:,:,0], [M, N, 1])], axis = 2)
Mask1 = tf.cast(tf.reshape(IND, [1, N, 1]), tf.float64)
Mask0 = 1 - Mask1
Res = tf.multiply(Mask0, DATA) + tf.multiply(Mask1, DATA1)

How to compare two numpy arrays and add missing values to the other with a tweak

I have two numpy arrays of different dimension. I want to add those additional elements of the bigger array to the smaller array, only the 0th element and the 1st element should be given as 0.
For example :
a = [ [2,4],[4,5], [8,9],[7,5]]
b = [ [2,5], [4,6]]
After adding the missing elements to b, b would become as follows :
b [ [2,5], [4,6], [8,0], [7,0] ]
I have tried the logic up to some extent, however some values are getting redundantly added as I am not able to check whether that element has already been added to b or not.
Secondly, I am doing it with the help of an additional array c which is the copy of b and then doing the desired operations to c. If somebody can show me how to do it without the third array c , would be very helpful.
import numpy as np
a = [[2,3],[4,5],[6,8], [9,6]]
b = [[2,3],[4,5]]
a = np.array(a)
b = np.array(b)
c = np.array(b)
for i in range(len(b)):
for j in range(len(a)):
if a[j,0] == b[i,0]:
print "matched "
else:
print "not matched"
c= np.insert(c, len(c), [a[j,0], 0], axis = 0)
print c
#####For explanation#####
#basic set operation to get the missing elements
c = set([i[0] for i in a]) - set([i[0] for i in b])
#c will just store the missing elements....
#then just append the elements
for i in c:
b.append([i, 0])
Output -
[[2, 5], [4, 6], [8, 0], [7, 0]]
Edit -
But as they are numpy arrays you can just do this (and without using c as an intermediate) - just two lines
for i in set(a[:, 0]) - (set(b[:, 0])):
b = np.append(b, [[i, 0]], axis = 0)
Output -
array([[2, 5],
[4, 6],
[8, 0],
[7, 0]])
You can use np.in1d to look for matching rows from b in a to get a mask and based on the mask choose rows from a or set to zeros. Thus, we would have a vectorized approach as shown below -
np.vstack((b,a[~np.in1d(a[:,0],b[:,0])]*[1,0]))
Sample run -
In [47]: a
Out[47]:
array([[2, 4],
[4, 5],
[8, 9],
[7, 5]])
In [48]: b
Out[48]:
array([[8, 7],
[4, 6]])
In [49]: np.vstack((b,a[~np.in1d(a[:,0],b[:,0])]*[1,0]))
Out[49]:
array([[8, 7],
[4, 6],
[2, 0],
[7, 0]])
First we should clear up one misconception. c does not have to be a copy. A new variable assignment is sufficient.
c = b
...
c= np.insert(c, len(c), [a[j,0], 0], axis = 0)
np.insert is not modifying any of its inputs. Rather it makes a new array. And the c=... just assigns that to c, replacing the original assignment. So the original c assignment just makes writing the iteration easier.
Since you are adding this new [a[j,0],0] at the end, you could use concatenate (the underlying function used by insert and stack(s).
c = np.concatenate((c, [a[j,0],0]), axis=0)
That won't make much of a change in the run time. It's better to find all the a[j] and add them all at once.
In this case you want to add a[2,0] and a[3,0]. Leaving aside, for the moment, the question of how we find [2,3], we can do:
In [595]: a=np.array([[2,3],[4,5],[6,8],[9,6]])
In [596]: b=np.array([[2,3],[4,5]])
In [597]: ind = [2,3]
An assign and fill approach would look like:
In [605]: c = np.zeros_like(a) # target array
In [607]: c[0:b.shape[0],:] = b # fill in the b values
In [608]: c[b.shape[0]:,0] = a[ind,0] # fill in the selected a column
In [609]: c
Out[609]:
array([[2, 3],
[4, 5],
[6, 0],
[9, 0]])
A variation would be construct a temporary array with the new a values, and concatenate
In [613]: a1 = np.zeros((len(ind),2),a.dtype)
In [614]: a1[:,0] = a[ind,0]
In [616]: np.concatenate((b,a1),axis=0)
Out[616]:
array([[2, 3],
[4, 5],
[6, 0],
[9, 0]])
I'm using the a1 create and fill approach because I'm too lazy to figure out how to concatenate a[ind,0] with enough 0s to make the same thing. :)
As Divakar shows, np.in1d is a handy way of finding the matches
In [617]: np.in1d(a[:,0],b[:,0])
Out[617]: array([ True, True, False, False], dtype=bool)
In [618]: np.nonzero(~np.in1d(a[:,0],b[:,0]))
Out[618]: (array([2, 3], dtype=int32),)
In [619]: np.nonzero(~np.in1d(a[:,0],b[:,0]))[0]
Out[619]: array([2, 3], dtype=int32)
In [620]: ind=np.nonzero(~np.in1d(a[:,0],b[:,0]))[0]
If you don't care about the order a[ind,0] can also be gotten with np.setdiff1d(a[:,0],b[:,0]) (the values will be sorted).
Assuming you are working on a single dimensional array:
import numpy as np
a = np.linspace(1, 90, 90)
b = np.array([1,2,3,4,5,6,7,8,9,10,11,13,14,15,16,17,18,19,20,
21,22,23,24,25,27,28,31,32,33,34,35,36,37,38,39,
40,41,42,43,44,46,47,48,49,50,51,52,53,54,55,56,
57,58,59,60,61,62,63,64,65,67,70,72,73,74,75,76,
77,78,79,80,81,82,84,85,86,87,88,89,90])
m_num = np.setxor1d(a, b).astype(np.uint8)
print("Total {0} numbers missing: {1}".format(len(m_num), m_num))
This also works in a 2D space:
t1 = np.reshape(a, (10, 9))
t2 = np.reshape(b, (10, 8))
m_num2 = np.setxor1d(t1, t2).astype(np.uint8)
print("Total {0} numbers missing: {1}".format(len(m_num2), m_num2))

Categories

Resources