Get unique intersection values of two sets - python

I'd like to get the indexes of unique vectors using hash (for matrices it is efficient) but np.intersect1d does not give indices, it gives values. np.in1d on the other hand does give indices but not unique ones. I zipped a dict to make it work but it doesn't seem like the most efficient. I am new to python so trying to see if there is a better way to do this. Thanks for the help!
code:
import numpy as np
import hashlib
x=np.array([[1, 2, 3],[1, 2, 3], [4, 5, 6], [7, 8, 9]])
y=np.array([[4, 5, 6], [7, 8, 9],[1, 2, 3]])
xhash=[hashlib.sha1(row).digest() for row in x]
yhash=[hashlib.sha1(row).digest() for row in y]
z=np.intersect1d(xhash,yhash)
idx=list(range(len(xhash)))
d=dict(zip(xhash,idx))
unique_idx=[d[i] for i in z] #is there a better way to get this or boolean array
print(unique_idx)
uniques=np.array([x[i] for i in unique_idx])
print(uniques)
output:
>>> [2, 3, 1]
[[4 5 6]
[7 8 9]
[1 2 3]]
I'm having a similar issue for np.unique() where it doesn't give me any indexes.

Use np.unique's return_index property to return flags for the unique values given by in1d
code:
import numpy as np
import hashlib
x=np.array([[1, 2, 3],[1, 2, 3], [4, 5, 6], [7, 8, 9]])
y=np.array([[1, 2, 3], [7, 8, 9]])
xhash=[hashlib.sha1(row).digest() for row in x]
yhash=[hashlib.sha1(row).digest() for row in y]
z=np.in1d(xhash,yhash)
##Use unique to get unique indices to ind1 results
_,unique=np.unique(np.array(xhash)[z],return_index=True)
##Compute indices by indexing an array of indices
idx=np.array(range(len(xhash)))
unique_idx=(np.array(idx)[z])[unique]
print('x=',x)
print('unique_idx=',unique_idx)
print('x[unique_idx]=',x[unique_idx])
Output:
x= [[1 2 3]
[1 2 3]
[4 5 6]
[7 8 9]]
unique_idx= [3 0]
x[unique_idx]= [[7 8 9]
[1 2 3]]

The numpy_indexed package (disclaimer: I am its author) has efficient functionality for doing things like this (and related functionality):
import numpy_indexed as npi
uniques = npi.intersection(x, y)
Note that this solution does not use hashing, but bitwise equality of the elements of the sequence; so no risk of hash collisions, and likely a lot faster in practice.

Related

How to get N maximum values of each array from a numpy array of arrays

I have a numpy array of arrays x = [[1, 3, 4, 5], [6, 2, 5, 7]]. I want to get N maximum values from each array of the numpy array: [[5, 4], [7, 6]]. I have tried using np.argpartition(x, -N, axis=0)[-N:] but it gives ValueError: kth(=-3) out of bounds (1). What is the efficient way for doing this?
You can do this by sorting each row and slicing as you want:
np.sort(x, axis=1)[:, :2] # --> [[1 3] [2 5]] 2 minimum in each row
np.sort(x, axis=1)[:, 2:] # --> [[4 5] [6 7]] 2 maximum in each row

Use numpy to stack combinations of a 1D and 2D array

I have 2 numpy arrays, one 2D and the other 1D, for example like this:
import numpy as np
a = np.array(
[
[1, 2],
[3, 4],
[5, 6]
]
)
b = np.array(
[7, 8, 9, 10]
)
I want to get all possible combinations of the elements in a and b, treating a like a 1D array, so that it leaves the rows in a intact, but also joins the rows in a with the items in b. It would look something like this:
>>> combine1d(a, b)
[ [1 2 7] [1 2 8] [1 2 9] [1 2 10]
[3 4 7] [3 4 8] [3 4 9] [3 4 10]
[5 6 7] [5 6 8] [5 6 9] [5 6 10] ]
I know that there are slow solutions for this (like a for loop), but I need a fast solution to this as I am working with datasets with millions of integers.
Any ideas?
This is one of those cases where it's easier to build a higher dimensional object, and then fix the axes when you're done. The first two dimensions are the length of b and the length of a. The third dimension is the number of elements in each row of a plus 1. We can then use broadcasting to fill in this array.
x, y = a.shape
z, = b.shape
result = np.empty((z, x, y + 1))
result[...,:y] = a
result[...,y] = b[:,None]
At this point, to get the exact answer you asked for, you'll need to swap the first two axes, and then merge those two axes into a single axis.
result.swapaxes(0, 1).reshape(-1, y + 1)
An hour later. . . .
I realized by being a little bit more clever, I didn't need to swap axes. This also has the nice benefit that the result is a contiguous array.
def convert1d(a, b):
x, y = a.shape
z, = b.shape
result = np.empty((x, z, y + 1))
result[...,:y] = a[:,None,:]
result[...,y] = b
return result.reshape(-1, y + 1)
this is very "scotch tape" solution:
import numpy as np
a = np.array(
[
[1, 2],
[3, 4],
[5, 6]
]
)
b = np.array(
[7, 8, 9, 10]
)
z = []
for x in b:
for y in a:
z.append(np.append(y, x))
np.array(z).reshape(3, 4, 3)
You need to use np.c_ to attach to join two dataframe. I also used np.full to generate a column of second array (b). The result are like what follows:
result = [np.c_[a, np.full((a.shape[0],1), x)] for x in b]
result
Output
[array([[1, 2, 7],
[3, 4, 7],
[5, 6, 7]]),
array([[1, 2, 8],
[3, 4, 8],
[5, 6, 8]]),
array([[1, 2, 9],
[3, 4, 9],
[5, 6, 9]]),
array([[ 1, 2, 10],
[ 3, 4, 10],
[ 5, 6, 10]])]
The output might be kind of messy. But it's exactly like what you mentioned as your desired output. To make sure, you cun run below to see what comes from the first element in the result array:
print(result[0])
Output
array([[1, 2, 7],
[3, 4, 7],
[5, 6, 7]])

Elements overlapping rows and columns

Question:
Create a array x of shape (n_row.n_col), having first n natural numbers.
N = 30, n_row= 6, n_col=5
Print elements, overlapping first two rows and last three columns.
Expected output:
[[2 3 4]
[7 8 9]]
My output:
[2 3 7 8]
My approach:
x = np.arange (n)
x= x.reshape(n_row,n_col)
a= np. intersect1d(x[0:2,],x[:,-3:-1])
print (a)
I couldn't think of anything else, please help
The overlap of row and column slices of the same array is just the combined slice
import numpy as np
x = np.arange(30).reshape(6, 5)
x[:2,-3:]
Output
array([[2, 3, 4],
[7, 8, 9]])
To compute the overlap by finding same elements is odd but possible
r, c = np.where(np.isin(x, np.intersect1d(x[:2], x[:,-3:])))
x[np.ix_(np.unique(r), np.unique(c))]
Output
array([[2, 3, 4],
[7, 8, 9]])
I think the answers are a bit convoluted...
Personally from the original question:
Question: Create a array x of shape (n_row.n_col), having first n natural numbers. N = 30, n_row= 6, n_col=5
Print elements, overlapping first two rows and last three columns.
I understand "sub-indexing":
N, n_rows, n_cols = 30, 6, 5
a = np.arange(N).reshape(n_rows, n_cols)
print(a[:2, -3:])
Output:
[[2, 3, 4],
[7, 8, 9]]

Appending contents of 1D numpy array to another 2D numpy array

I have three numpy arrays. The shape of the first is (413, 2), the shape of the second is (176, 2), and the shape of the third is (589,). If you'll notice, 413 + 176 = 589. What I want to accomplish is to use the 589 values of the third np array and make the first two arrays of shapes (413, 3) and (176, 3) respectively.
So, what I want is to take the values in the third np array and append them to the columns of the first and second np arrays. I can do the logic for applying to the first and then using the offset of the length of the first to continue appending to the second with the correct values. I suppose I could also combine np arrays 1 and 2, they are separated for a reason though because of my data preprocessing.
To put it visually if that helps, what I have is like this:
Array 1:
[[1 2]
[3 4]
[4 5]]
Array 2:
[[6 7]
[8 9]
[10 11]]
Array 3:
[1 2 3 4 5 6]
And what I want to have is:
Array 1:
[[1 2 1]
[3 4 2]
[4 5 3]]
Array 2:
[[6 7 4]
[8 9 5]
[10 11 6]]
I've tried using np.append, np.concatenate, and np.vstack but have not been able to achieve what I am looking for. I am relatively new to using numpy, and Python in general, so I imagine I am just using these tools incorrectly.
Many thanks for any help that can be offered! This is my first time asking a question here so if I did anything wrong or left anything out please let me know.
Split the third array using the length of array1, then horizontally stack them. You need to use either np.newaxis or array.reshape to change the dimensionality of the slice of array3.
import numpy as np
array1 = np.array(
[[1, 2],
[3, 4],
[4, 5]]
)
array2 = np.array(
[[6, 7],
[8, 9],
[10, 11]]
)
array3 = np.array([1, 2, 3, 4, 5, 6])
array13 = np.hstack([array1, array3[:len(array1), np.newaxis]])
array23 = np.hstack([array1, array3[len(array1):, np.newaxis]])
Outputs:
array13
array([[1, 2, 4],
[3, 4, 5],
[4, 5, 6]])
array23
array([[ 6, 7, 4],
[ 8, 9, 5],
[10, 11, 6]])

How to get max (top) N values across entire numpy matrix

I want to get the top N (maximal) args & values across an entire numpy matrix, as opposed to across a single dimension (rows / columns).
Example input (with N=3):
import numpy as np
mat = np.matrix([[9,8, 1, 2], [3, 7, 2, 5], [0, 3, 6, 2], [0, 2, 1, 5]])
print(mat)
[[9 8 1 2]
[3 7 2 5]
[0 3 6 2]
[0 2 1 5]]
Desired output: [9, 8, 7]
Since max isn't transitive across a single dimension, going by rows or columns doesn't work.
# by rows, no 8
np.squeeze(np.asarray(mat.max(1).reshape(-1)))[:3]
array([9, 7, 6])
# by cols, no 7
np.squeeze(np.asarray(mat.max(0)))[:3]
array([9, 8, 6])
I have code that works, but looks really clunky to me.
# reshape into single vector
mat_as_vector = np.squeeze(np.asarray(mat.reshape(-1)))
# get top 3 arg positions
top3_args = mat_as_vector.argsort()[::-1][:3]
# subset the reshaped matrix
top3_vals = mat_as_vector[top3_args]
print(top3_vals)
array([9, 8, 7])
Would appreciate any shorter way / more efficient way / magic numpy function to do this!
Using numpy.partition() is significantly faster than performing full sort for this purpose:
np.partition(np.asarray(mat), mat.size - N, axis=None)[-N:]
assuming N<=mat.size.
If you need the final result also be sorted (besides being top N), then you need to sort previous result (but presumably you will be sorting a smaller array than the original one):
np.sort(np.partition(np.asarray(mat), mat.size - N, axis=None)[-N:])
If you need the result sorted from largest to lowest, post-pend [::-1] to the previous command:
np.sort(np.partition(np.asarray(mat), mat.size - N, axis=None)[-N:])[::-1]
One way may be with flatten and sorted and slice top n values:
sorted(mat.flatten().tolist()[0], reverse=True)[:3]
Result:
[9, 8, 7]
The idea is from this answer: How to get indices of N maximum values in a numpy array?
import numpy as np
import heapq
mat = np.matrix([[9,8, 1, 2], [3, 7, 2, 5], [0, 3, 6, 2], [0, 2, 1, 5]])
ind = heapq.nlargest(3, range(mat.size), mat.take)
print(mat.take(ind).tolist()[0])
Output
[9, 8, 7]

Categories

Resources