Finding repeated rows in a numpy array

Finding repeated rows in a numpy array - python

The following function is designed to find the unique rows of an array:
def unique_rows(a):
b = np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, idx = np.unique(b, return_index=True)
unique_a = a[idx]
return unique_a
For example,
test = np.array([[1,0,1],[1,1,1],[1,0,1]])
unique_rows(test)
[[1,0,1],[1,1,1]]
I believe that this function should work all the time, however it may not be watertight. In my code I would like to calculate how many unique positions exist for a set of particles. The particles are stored in a 2d array, each row corresponding to the position of a particle. The positions are of type np.float64.
I have also defined the following function
def pos_tag(pos):
x,y,z = pos[:,0],pos[:,1],pos[:,2]
return (2**x)*(3**y)*(5**z)
In principle this function should produce a unique value for any (x,y,z) position.
However, when I use these to functions to calculate the number of unique positions in my set of particles they produce different answers. Is this due to some possible logical flaw in the first function, or the second function not producing a unique value for each given position?
EDIT: Usage example
I have some long code that produces a 2d array of particle postions.
partpos.shape = (6039539,3)
I then calculate the number of unique rows as follows
len(unqiue_rows(partpos))
6034411
And
posids = pos_tag(partpos)
len(np.unique(posids))
5328871

I believe that the discrepancy arises due to a precision error.
Using the code
print len(unique_rows(partpos.astype(np.float32)))
print len(np.unique(pos_tag(partpos)))
6034411
6034411
However with
print len(unique_rows(partpos.astype(np.float32)))
print len(np.unique(pos_tag(partpos.astype(np.float32))))
6034411
5328871

a = [[1,0,1],[1,1,1],[1,0,1]]
# Convert rows to tuples so they're hashable, creating a generator thereof
b = (tuple(row) for row in a)
# Convert back to list of lists, after coercing to a set to eliminate non-unique rows
unique_rows = list(list(row) for row in set(b))
Edit: Well that's embarrassing. I just realized I didn't really address the question asked. This could still be the answer the OP is looking for, so I'll leave it, but it's not really what was asked. Sorry for that.

Related

Faster way to find indices in an Array using points which get from two arrays combination in Python

I have two arrays which contains instances from DATA called A and B. These two arrays then refer to another array called Distance.
I need the fast way to:
find the points combination between A and B,
find the results of the distance from the combination in Distance
For example:
DATA = [0,1,...100]
A = [0,1,2]
B = [6,7,8]
Distance = [100x100] # contains the pairwise distance of all instances from DATA
# need a function to combine A and B
points_combination=[[0,6],[0,7],[0,8],[1,6],[1,7],[1,8],[2,6],[2,7],[2,8]]
# need a function to refer points_combination with Distance, so that I can get this results
distance_points=[0.346, 0.270, 0.314, 0.339, 0.241, 0.283, 0.304, 0.294, 0.254]
I already try to solve it myself, but when it deals with large data it's very slow
Here's the code I tried:
import numpy as np
def function(pair_distances, k, clusters):
list_distance = []
cluster_qty = k
for cluster_id in range(cluster_qty):
all_clusters = clusters[:] # List of all instances ID on their own cluster
in_cluster = all_clusters.pop(cluster_id) # List of instances ID inside the cluster
not_in_cluster = all_clusters # List of instances ID outside the cluster
# combine A and B array into a points to refer to Distance array
list_dist_id = np.array(np.meshgrid(in_cluster, np.concatenate(not_in_cluster))).T.reshape(-1, 2)
temp_dist = 9999999
for instance in range(len(list_dist_id)):
# basically refer the distance value from the pair_distances array
temp_dist = min(temp_dist, (pair_distances[list_dist_id[instance][0], list_dist_id[instance][1]]))
list_distance.append(temp_dist)
return list_distance
Notice that the nested loop is the source of the time consuming problem.
This is my first time asking in this forum, so please let me know if you need more information.

The first part(points_combination) is extensively covered in this post already:
Cartesian product of x and y array points into single array of 2D points
The second part (distance_points): seems that algorithm linking points_combination to distance_points is not provided. Would be helpful if you could provide small sample data sets indicating how to go from data sets to your distance_points ?

Merge one tensor into other tensor on specific indexes in PyTorch

Any efficient way to merge one tensor to another in Pytorch, but on specific indexes.
Here is my full problem.
I have a list of indexes of a tensor in below code xy is the original tensor.
I need to preserve the rows (those rows who are in indexes list) of xy and apply some function on elements other than those indexes (For simplicity let say the function is 'multiply them with two),
xy = torch.rand(100,4)
indexes=[1,2,55,44,66,99,3,65,47,88,99,0]
Then merge them back into the original tensor.
This is what I have done so far:
I create a mask tensor
indexes=[1,2,55,44,66,99,3,65,47,88,99,0]
xy = torch.rand(100,4)
mask=[]
for i in range(0,xy.shape[0]):
if i in indexes:
mask.append(False)
else:
mask.append(True)
print(mask)
import numpy as np
target_mask = torch.from_numpy(np.array(mask, dtype=bool))
print(target_mask.sum()) #output is 89 as these are element other than preserved.
Apply the function on masked rows
zy = xy[target_mask]
print(zy)
zy=zy*2
print(zy)
Code above is working fine and posted here to clarify the problem
Now I want to merge tensor zy into xy on specified index saved in the list indexes.
Here is the pseudocode I made, as one can see it is too complex and need 3 for loops to complete the task. and it will be too much resources wastage.
# pseudocode
for masked_row in indexes:
for xy_rows_index in xy:
if xy_rows_index= masked_row
pass
else:
take zy tensor row and replace here #another loop to read zy.
But I am not sure what is an efficient way to merge them, as I don't want to use NumPy or for loop etc. It will make the process slow, as the original tensor is too big and I am going to use GPU.
Any efficient way in Pytorch for this?

Once you have your mask you can assign updated values in place.
zy = 2 * xy[target_mask]
xy[target_mask] = zy
As for acquiring the mask I don't see a problem necessarily with your approach, though using the built-in set operations would probably be more efficient. This also gives an index tensor instead of a mask, which, depending on the number of indices being updated, may be more efficient.
i = list(set(range(len(xy)))-set(indexes))
zy = 2 * xy[i]
xy[i] = zy
Edit:
To address the comment, specifically to find the complement of indices of i we can do
i_complement = list(set(range(len(xy)))-set(i))
However, assuming indexes contains only values between 0 and len(xy)-1 then we could equivalently use i_complement = len(set(indexes)), which just removes the repeated values in indexes.

How to search for the position of specific XY pairs in a 2 dimensional numpy array?

I have an image stored as 3 Numpy arrays:
# Int arrays of coordinates
# Not continuous, some points are omitted
X_image = np.array([1,2,3,4,5,6,7,9])
Y_image = np.array([9,8,7,6,5,4,3,1])
# Float array of RGB values.
# Same index
rgb = np.array([
[0.5543,0.2665,0.5589],
[0.5544,0.1665,0.5589],
[0.2241,0.6645,0.5249],
[0.2242,0.6445,0.2239],
[0.2877,0.6425,0.5829],
[0.5543,0.3165,0.2839],
[0.3224,0.4635,0.5879],
[0.5534,0.6693,0.5889],
])
The RGB information is not convertible to int. So it has to stay floats
I have another array that defines the position of an area of some pixels in the image:
X_area = np.array([3,4,6])
Y_area = np.array([7,6,4])
I need to find the RGB information for these pixels, using the first 4 arrays as a reference.
My idea was to search for the index of these area points in the full image and then use this index to find back the RGB information.
index = search_for_index_of_array_1_in_array_2((X_area,Y_area),(X_image,Y_image))
# index shall be [3,4,6]
rgb_area = rgb[index]
The search_for_index_of_array_1_in_array_2 can be implemented with a for loop. I tried it, this is too slow. I actually have millions of points.
I know that it is probably more of a use case for Julia than Python, as we deal with low-level data manipulation with a performance need, but I'm obliged to use Python. So, the only performance trick I see is to use a vectorized solution with NumPy.
I'm not used to manipulating NumPy. I tried numpy.where.
index = np.where(X_area in X_image and Y_area in Y_image )
index
Gives :
<ipython-input-18-0e434ab7a291>:1: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
index = np.where(X_area in X_image and Y_area in Y_image )
(array([], dtype=int64),)
It shall be empty as we have 3 compliant points.
I also tested, with the same result:
XY_image = np.vstack((X_image,Y_image))
XY_area = np.vstack((X_area,Y_area))
index = np.where(XY_area == XY_image)
and even:
np.extract(XY_image == XY_area, XY_image)
If I get it, the issue is that the arrays do not have the same length. But this is what I have.
Do you have an idea of how to proceed?
Thanks
Edit: here is a loop that works but... is not fast:
indexes = []
for i in range(XY_area.shape[1]):
XY_area_b = np.broadcast_to(XY_area[:,i],(9,2)).transpose()
where_in_image = np.where(XY_area_b == XY_image)
index_in_image = where_in_image[1][1]
indexes.append(index_in_image)
indexes

The classical method to solve this problem is generally to use a hashmap. However, Numpy do not provide such a data structure. That being said, an alternative (generally slower) solution is to sort the values and then perform a binary search. Hopefully, Numpy provide useful functions to do that. This solution run in O(n log(m)) (with n the number of value to search and m the number of value searched) should be much faster than a linear search running in O(n m) time. Here is an example:
# Format the inputs
valType = X_image.dtype
assert Y_image.dtype == valType and X_area.dtype == valType and X_image.dtype == valType
pointType = [('x', valType),('y', valType)]
XY_image = np.ravel(np.column_stack((X_image, Y_image))).view(pointType)
XY_area = np.ravel(np.column_stack((X_area, Y_area))).view(pointType)
# Build an index to sort XY_image and then generate the sorted points
sortingIndex = np.argsort(XY_image)
sorted_XY_image = XY_image[sortingIndex]
# Search each value of XY_area in XY_image and find the location in the unsorted array
tmp = np.searchsorted(XY_image, XY_area)
index = sortingIndex[tmp]
rgb_area = rgb[index]

Thanks to Jérôme's answer, I understand better the value of using a hashmap:
def hashmap(X,Y):
return X + 10000*Y
h_area = hashmap(X_area,Y_area)
h_image = hashmap(X_image,Y_image)
np.where(np.isin(h_image,h_area))
This hashmap is a bit brutal, but it actually returns the indexes:
(array([2, 3, 5], dtype=int64),)

Finding and removing palindrome rows in 2D numpy array

What would be pythonic and effective way to find/remove palindrome rows from matrix. Though the title suggests matrix to be a numpy ndarray, it can be pandas DataFrame if it lead to more elegant solution.
Obvious way would be to implement this using for-loop, but I'm interested is there a more effective and succint way.
My first idea was to concatenate rows and rows-inverse, and then extract duplicates from concatenated matrix. But this list of duplicates will contain both initial row and its inverse. So to remove second instance of a palindrome I'd still have to do some for-looping.
My second idea was to somehow use broadcasting to get cartesian product of rows and apply my own ufunc (perhaps created using numba) to get 2D bool matrix. But I don't know how to create ufunc that would get matrix axis, instead of scalar.
EDIT:
I guess I should apologize for poorly formulated question (English is not my native language). I don't need to find out if any row itself is palindrome, but if there are pairs of rows within matrix that are palindromes.

I simply check if the array is equal its reflection (around axis 1) in all elements, if true it is a palindrome (correct me if I am wrong). Then I index out the rows that aren't palindromes.
import numpy as np
a = np.array([
[1,0,0,1], # Palindrome
[0,2,2,0], # Palindrome
[1,2,3,4],
[0,1,4,0],
])
wherepalindrome = (a == a[:,::-1]).all(1)
print(a[~wherepalindrome])
#[[1 2 3 4]
# [0 1 4 0]]

Naphat's answer is the pythonic (numpythonic) way to go. That should be the accepted answer.
But if your array is really large, you don't want to create a temporary copy, and you wish to explore Numba's intricacies, you can use something like this:
import numba as nb
#nb.njit(parallel=True)
def palindromic_rows(a):
rows, cols = a.shape
palindromes = np.full(rows, True, dtype=nb.boolean)
mid = cols // 2
for r in nb.prange(rows): # <-- parallel loop
for c in range(mid):
if a[r, c] != a[r, -c-1]:
palindromes[r] = False
break
return palindromes
This contraption just replaces the elegant (a == a[:,::-1]).all(axis=1), but it's almost an order of magnitude faster for very large arrays and it doesn't duplicate them.

Why is the change of values in one array is affecting the value on another array (with exact same values) that isn't involved in the operation?

When I run the command "negative_only[negative_only>0]=0" (which should make positive values = 0 on the array "negative_only") the values on a similar array ("positive_only") are also changed. Why is this happening? I'm using Python 3.7 (Windows 10 / Spyder IDE).
The code where the two arrays are being manipulated is below. The "long_dollars_ch" is an array of ~2700 x 60 with some positive values, some negative values and a lot of zeros. This code is part of a loop that is cycling through each row of the array "long_dollars_ch".
# calculations to isolate top contributors to NAV change for audits
top_check = 3 # number of top values changes to track
# calculate dollar change (for longs), and create array with most positive/negative values
long_dollars_ch[c_day,:] = long_shares[c_day,:]*hist_prices_zeros[c_day,:]-long_shares[c_day,:]*hist_prices_zeros[c_day-1,:]
positive_only = long_dollars_ch[c_day,:]
positive_only[positive_only<0]=0 #makes non-positive values zero
idx = np.argsort(positive_only) #creat index representing sorted values for only_positive for c_day
non_top_vals = idx[:-top_check]
negative_only = long_dollars_ch[c_day,:]
negative_only[negative_only>0]=0 #makes non-negative values zero
idx = np.argsort(negative_only) #creat index representing sorted values for only_negative for c_day
non_bottom_vals = idx[:-top_check]
# create array that shows the most positive/negative dollar change for "top-check" securities
long_dollars_ch_pos[c_day,:] = positive_only
long_dollars_ch_pos[c_day,:][non_top_vals] *= 0
long_dollars_ch_neg[c_day,:] = negative_only
long_dollars_ch_neg[c_day,:][non_bottom_vals] *= 0
The objective of this code is to create two arrays. One that has only the top "top_check" positive values (if any) and another that has the bottom "top_check" negative values (if any) for each row of the original array "long_dollars_ch". However, it appears that Python is considering the "positive_only" and "negative_only" the same "variable". And therefore, the operation with one of them affects the values inside the other (that was not part of the operation).

its quite simple.
In numpy np.array x = np.array y you do not copy array :)
You make a reference to array x.
In other words, you do not have two array after using "=". You still have one array x, and a reference to that array (y is that reference).
positive_only = long_dollars_ch[c_day,:]
.
.
,
negative_only = long_dollars_ch[c_day,:]
do not make a copy of long_dollars_ch, but only makes references to it.
you need to use copy method, or other method (numpy provides few of them) to make it work.
Here is a documentation
EDIT: I posted wrong link, now it is ok.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding repeated rows in a numpy array - python

Related

Faster way to find indices in an Array using points which get from two arrays combination in Python

Merge one tensor into other tensor on specific indexes in PyTorch

How to search for the position of specific XY pairs in a 2 dimensional numpy array?

Finding and removing palindrome rows in 2D numpy array

Why is the change of values in one array is affecting the value on another array (with exact same values) that isn't involved in the operation?

Categories

Resources