Related
System
OS: Windows 10 (x64), Build 1909
Python Version: 3.8.10
Numpy Version: 1.21.2
Question
Given two 2D (N, 3) Numpy arrays of (x, y, z) floating-point data points, what is the Pythonic (vectorized) way to find the indices in one array where points are equal to the points in the other array?
(NOTE: My question differs in that I need this to work with real-world data sets where the two data sets may differ by floating point error. Please read on below for details.)
History
Very similar questions have been asked many times:
how to find indices of a 2d numpy array occuring in another 2d array
test for membership in a 2d numpy array
Get indices of intersecting rows of Numpy 2d Array
Find indices of rows of numpy 2d array in another 2D array
Indices of intersecting rows of Numpy 2d Array
Find indices of rows of numpy 2d array in another 2D array
Previous Attempts
SO Post 1 provides a working list comprehension solution, but I am looking for a solution that will scale well to large data sets (i.e. millions of points):
Code 1:
import numpy as np
if __name__ == "__main__":
big_array = np.array(
[
[1.0, 2.0, 1.2],
[5.0, 3.0, 0.12],
[-1.0, 14.0, 0.0],
[-9.0, 0.0, 13.0],
]
)
small_array = np.array(
[
[5.0, 3.0, 0.12],
[-9.0, 0.0, 13.0],
]
)
inds = [
ndx
for ndx, barr in enumerate(big_array)
for sarr in small_array
if all(sarr == barr)
]
print(inds)
Output 1:
[1, 2]
Attempting the solution of SO Post 3 (similar to SO Post 2), but using floats does not work (and I suspect something using np.isclose will be needed):
Code 3:
import numpy as np
if __name__ == "__main__":
big_array = np.array(
[
[1.0, 2.0, 1.2],
[5.0, 3.0, 0.12],
[-1.0, 14.0, 0.0],
[-9.0, 0.0, 13.0],
],
dtype=float,
)
small_array = np.array(
[
[5.0, 3.0, 0.12],
[-9.0, 0.0, 13.0],
],
dtype=float,
)
inds = np.nonzero(
np.in1d(big_array.view("f,f").reshape(-1), small_array.view("f,f").reshape(-1))
)[0]
print(inds)
Output 3:
[ 3 4 5 8 9 10 11]
My Attempt
I have tried numpy.isin with np.all and np.argwhere
inds = np.argwhere(np.all(np.isin(big_array, small_array), axis=1)).reshape(-1)
which works (and, I argue, much more readable and understandable; i.e. pythonic), but will not work for real-world data sets containing floating-point errors:
import numpy as np
if __name__ == "__main__":
big_array = np.array(
[
[1.0, 2.0, 1.2],
[5.0, 3.0, 0.12],
[-1.0, 14.0, 0.0],
[-9.0, 0.0, 13.0],
],
dtype=float,
)
small_array = np.array(
[
[5.0, 3.0, 0.12],
[-9.0, 0.0, 13.0],
],
dtype=float,
)
small_array_fpe = np.array(
[
[5.0 + 1e-9, 3.0 + 1e-9, 0.12 + 1e-9],
[-9.0 + 1e-9, 0.0 + 1e-9, 13.0 + 1e-9],
],
dtype=float,
)
inds_no_fpe = np.argwhere(np.all(np.isin(big_array, small_array), axis=1)).reshape(-1)
inds_with_fpe = np.argwhere(
np.all(np.isin(big_array, small_array_fpe), axis=1)
).reshape(-1)
print(f"No Floating Point Error: {inds_no_fpe}")
print(f"With Floating Point Error: {inds_with_fpe}")
print(f"Are 5.0 and 5.0+1e-9 close?: {np.isclose(5.0, 5.0 + 1e-9)}")
Output:
No Floating Point Error: [1 3]
With Floating Point Error: []
Are 5.0 and 5.0+1e-9 close?: True
How can I make my above solution work (on data sets with floating point error) by incorporating np.isclose? Alternative solutions are welcome.
NOTE: Since small_array is a subset of big_array, using np.isclose directly doesn't work because the shapes won't broadcast:
np.isclose(big_array, small_array_fpe)
yields
ValueError: operands could not be broadcast together with shapes (4,3) (2,3)
Update
Currently, the only working solution I have is
inds_with_fpe = [
ndx
for ndx, barr in enumerate(big_array)
for sarr in small_array_fpe
if np.all(np.isclose(sarr, barr))
]
As #Michael Anderson already mentioned this can be implemented using a kd-tree. In comparsion to your answer this solution is using an absolute error. If this is acceptable or not depends on the problem.
Example
import numpy as np
from scipy import spatial
def find_nearest(big_array,small_array,tolerance):
tree_big=spatial.cKDTree(big_array)
tree_small=spatial.cKDTree(small_array)
return tree_small.query_ball_tree(tree_big,r=tolerance)
Timings
big_array=np.random.rand(100_000,3)
small_array=np.random.rand(1_000,3)
big_array[1000:2000]=small_array
%timeit find_nearest(big_array,small_array,1e-9) #find all pairs within a distance of 1e-9
#55.7 ms ± 830 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#A. Hendry
%timeit np.argwhere(np.isclose(small_array, big_array[:, None, :]).all(-1).any(-1)).reshape(-1)
#3.24 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I'm not going to give any code, but I've dealt with problems similar to this on a large scale. I suspect that to get decent performance with either of these approaches you'll need to implement the core in C (you might get away with using numba).
If both your arrays are huge there are a few approaches that can work.
Primarily these boil down to building a structure that can be used to find the nearest neighbor of a point from one of the arrays, and then querying it for each point in the other data set.
To do this I've previously used a Kd Tree approach, and a grid based approach.
The basis of the grid based approach is
find the 3D extents of your first array.
split this region into LNM bins.
For each input point in the second array, find its bin. Any point that matches it will be in that bin.
The edge cases you need to handle are
if the point falls on the edge of a bin, or close enough to the boundary of a bin that points considered equal to it might fall in the other bin - then you need to search more than one bin for its "equal".
if the point falls outside all the bins, but close to the edge, points "equal" to it might fall in a nearby bin.
The downsides are that this is bad for data that is not uniformly distributed.
The upside is that it is relatively simple. Expected run time for uniform data is n1 * n2 / (L*N*M) (compared to n1*n2). Typically you select L,N,M such that this becomes O(n log(n)). You also get some further uplift from sorting the second array to improve reuse of the bins. It is also relatively easy to parallelize (both the binning and searching)
The K-d Tree approach is similar. IIRC it gives O(n log(n)) behavior, but it is trickier to implement, and the building of the data structure is tricky to parallelize). It tends to not be as cache friendly which can mean that although its asymptotic run-time is better than the grid based approach it can runs slower in practice. However it does give better guarantees for non-uniformly distributed data.
Credit to #AndrasDeak for this answer
The following code snippet
inds_with_fpe = np.argwhere(
np.isclose(small_array_fpe, big_array[:, None, :]).all(-1).any(-1)
).reshape(-1)
will make the code work. The corresponding output is now:
No Floating Point Error: [1 3]
With Floating Point Error: [1, 3]
Are 5.0 and 5.0+1e-9 close?: True
None in the above creates a new axis (same as np.newaxis). This changes the shape of the big_array array to (4, 1, 3), which adheres to broadcasting rules and permits np.isclose to run. That is, big_array is now a set of 4 1 x 3 points, and since one of the axes in big_array is 1, small_array_fpe can be broadcast to 2 1 x 3 arrays (i.e. shape (2, 1, 3)) and the elements can be compared element-wise.
The result is a (4, 2, 3) boolean array; every element of big_array is compared element-wise to every element of small_array_fpe and the components where they are close (within a specific tolerance) is returned. Since all is called as an object method rather than a numpy function, the first argument to the function is actually the axis rather than the input array. Hence, -1 in the above functions means "the last axis of the array".
We first return the indeces of the (4, 2, 3) array that are all True (i.e. all (x, y, z) components are equal), which yields a 4 x 2 array. Where any of these are True is the corresponding index in big_array where the points are equal, yielding a 4 x 1 array.
argwhere returns indices grouped by element, so its shape is normally (number nonzero items, num dims of input array), hence we flatten it into a 1d array with reshape(-1).
Unfortunately, this requires a quadratic amount memory w.r.t. the number of points in each array, since we must run through every element of big_array and check it against every element of small_array_fpe. For example, to search for 10,000 points in a set of another 10,000 points, for 32-bit floating point data, requires
Memory = 10000 * 10000 * 4 * 8 = 32 GiB RAM!
If anyone can devise a solution with a faster run time and reasonable amount of memory, that would be fantastic!
FYI:
from timeit import timeit
import numpy as np
big_array = np.array(
[
[1.0, 2.0, 1.2],
[5.0, 3.0, 0.12],
[-1.0, 14.0, 0.0],
[-9.0, 0.0, 13.0],
],
dtype=float,
)
small_array = np.array(
[
[5.0 + 1e-9, 3.0 + 1e-9, 0.12 + 1e-9],
[10.0, 2.0, 5.8],
[-9.0 + 1e-9, 0.0 + 1e-9, 13.0 + 1e-9],
],
dtype=float,
)
def approach01():
return [
ndx
for ndx, barr in enumerate(big_array)
for sarr in small_array
if np.all(np.isclose(sarr, barr))
]
def approach02():
return np.argwhere(
np.isclose(small_array, big_array[:, None, :]).all(-1).any(-1)
).reshape(-1)
if __name__ == "__main__":
time01 = timeit(
"approach01()",
number=10000,
setup="import numpy as np; from __main__ import approach01",
)
time02 = timeit(
"approach02()",
number=10000,
setup="import numpy as np; from __main__ import approach02",
)
print(f"Approach 1 (List Comprehension): {time01}")
print(f"Approach 2 (Vectorized): {time02}")
Output:
Approach 1 (List Comprehension): 8.1180582
Approach 2 (Vectorized): 0.9656997
Is there a simple python 3 command that replicates matlab's interp1 command over multiple columns?
data_1 contains two parameters (1 per column) that go with the time_1 time vector (data_1 is a 5 by 2 array that isn't actually used in this example so can be ignored)
data_2 contains two parameters (1 per column) that go with the time_2 time vector
import numpy as np
data_2 = np.array([ [ 0.43, -0.54], [ 0.32, -0.83], [ 0.26, -0.94], [ 0.51, -0.69], [ 0.63, -0.74] ])
time_1 = np.array([ 399.87, 399.89, 399.91, 399.93, 399.95 ])
time_2 = np.array([ 399.86, 399.88, 399.90, 399.92, 399.94 ])
I'd like to interpolate the data_2 2D array into the time_1 time vector so both data sets will have the same time vector.
Desired output (which is just the np.interp of the two data_2 columns into the time_1 time vector
and merged back into an array) is:
data_2_i = np.array([[ 0.375, -0.685], [ 0.290, -0.885], [ 0.385, -0.815], [ 0.570, -0.715], [ 0.630, -0.740]])
Actual arrays will contain approx 20 columns (parameters) and thousands of rows (longer time range).
I know you can just loop over each column with np.interp but I was hoping there was a more compact and faster python 3 (numpy, scipy, pandas, etc.) method that I haven't been able to track down yet. I'm still pretty new to python (more familiar with matlab).
In matlab, you can just use interp1 on the entire multi-column array to get the multi-column result (although the edge cases are handled a bit differently - NaNs vs. last entry in this example - I'm not worried about the edge case differences here).
This looks to work (just made a quick script myself):
import numpy as np
def interp_multi(x_i, x, y):
ncol = y.shape[1]
y_i = np.zeros((len(x_i),ncol))
for i in range(ncol):
y_i[:,i] = np.interp(x_i, x, y[:,i])
return y_i
data_2_i = interp_multi(time_1, time_2, data_2)
Let's say that there's a "master" array of times with these values:
master = [1.0, 1.25, 1.5, 1.75, 2.0, 2.25, 2.5, 2.75, 3.0]
I want to find the most "compatible" array among several candidates:
candidates = [
[0.01, 0.48, 1.03, 1.17, 1.5],
[1.25, 1.4, 1.5, 1.9, 2.0],
...
]
In this case I consider the first candidate most compatible because after adding 1 to each value, 4 of the values are very close to values that exist in master (the 2nd candidate only has 3 values that match `master'), and order matters (though we can say the arrays are already sorted with no duplicate values, since they represent times).
A physical example could be that master is an array of beat onsets for a clean recording of an audio track, while the candidates are arrays of beat onsets for various audio recordings that may or may not be of the same audio track. I'd like to find the candidate that is most likely to be a recording of (at least a portion of) the same audio track.
I'm not sure of an algorithm to choose among these candidates. I've done some searching that led me to topics like cross-correlation, string distance, and fuzzy matching, but I'd like to know if I'm missing the forest for the trees here. I'm most familiar with data analysis in NumPy and Pandas, so I will tag the question as such.
One way would be to create those sliding 1D arrays as a stacked 2D array with broadcasting and then get the distances against the 2D array with Scipy's cdist. Finally, we get the minimum distance along each row and choose the row with minimum of such distances. Thus, we would have an implementation like so -
from scipy.spatial.distance import cdist
Na = a.shape[1]
Nb = b.size
b2D = b[np.arange(Nb-Na+1)[:,None] + np.arange(Na)]
closesetID = cdist(a,b2D).min(1).argmin()
Sample run -
In [170]: a = np.random.randint(0,99,(400,500))
In [171]: b = np.random.randint(0,99,(700))
In [172]: b[100:100+a.shape[1]] = a[77,:] + np.random.randn(a.shape[1])
# Make b starting at 100th col same as 77th row from 'a' with added noise
In [173]: Na = a.shape[1]
...: Nb = b.size
...: b2D = b[np.arange(Nb-Na+1)[:,None] + np.arange(Na)]
...: closesetID = cdist(a,b2D).min(1).argmin()
...:
In [174]: closesetID
Out[174]: 77
Note: To me it looked like using the default option of cdist, which is the euclidean distance made sense for such a problem. There are numerous other options as listed in the docs that are based on differentiation between inputs and as such could replace the default one.
Let's say I have a standard 2d numpy array, let's call it my2darray with values. In this array there are two major sections. Let's say for each column, there is a specific row which separates "scenario1" and "scenario2". How can i create 2 masked arrays that represent the top section of my2darray and the bottom of my2darray. For example, i am interested in calculating the mean of the top half and the mean of the second half. One idea is to have a mask that is of the same shape as my2darray but that seems like a waste of memory. Is there a better idea? Let's say I have a vector, in which the length is equal to the number of rows in my2darray (in this case 6), i.e. I have
myvector=np.array([9, 15, 5,7,11,11])
I am using python 2.6 with numpy 1.5.0
Using NumPy's broadcasted comparison, we can create such a 2D mask in a vectorized manner. Rest of the work is all about sum-reduction along the first axis for which we can take help from np.einsum. Thus, we would have an implementation like so -
N = my2darray.shape[0]
mask = myvector <= np.arange(N)[:,None]
uout = np.true_divide(np.einsum('ij,ij->j',my2darray,~mask),myvector)
lout = np.true_divide(np.einsum('ij,ij->j',my2darray,mask),N-myvector)
Sample run to verify results -
In [184]: N = my2darray.shape[0]
...: mask = myvector <= np.arange(N)[:,None]
...: uout = np.true_divide(np.einsum('ij,ij->j',my2darray,~mask),myvector)
...: lout = np.true_divide(np.einsum('ij,ij->j',my2darray,mask),N-myvector)
...:
In [185]: uout
Out[185]: array([ 6. , 4.6, 4. , 0. ])
In [186]: [my2darray[:item,i].mean() for i,item in enumerate(myvector)]
Out[186]: [6.0, 4.5999999999999996, 4.0, 0.0] # Loopy version results
In [187]: lout
Out[187]: array([ 5.2 , 4. , 2.66666667, 2. ])
In [188]: [my2darray[item:,i].mean() for i,item in enumerate(myvector)]
Out[188]: [5.2000000000000002, 4.0, 2.6666666666666665, 2.0] # Loopy version
Another potentially faster way would be to calculate the summations for the upper mask, store it and from it, subtract the sum along the first axis along the entire length of the 2D input array. This could be then used for the calculation of the lower part average. Thus, after we store N and calculate mask, we would have -
usum = np.einsum('ij,ij->j',my2darray,~mask)
uout = np.true_divide(usums,myvector)
lout = np.true_divide(my2darray.sum(0) - usums,N-myvector)
I have a 1-dimensional numpy array scores of scores associated with some objects. These objects belong to some disjoint groups, and all the scores of the items in the first group are first, followed by the scores of the items in the second group, etc.
I'd like to create a 2-dimensional array where each row corresponds to a group, and each entry is the score of one of its items. If all the groups are of the same size I can just do:
scores.reshape((numGroups, groupSize))
Unfortunately, my groups may be of varying size. I understand that numpy doesn't support ragged arrays, but it is fine for me if the resulting array simply pads each row with a specified value to make all rows the same length.
To make this concrete, suppose I have set A with 3 items, B with 2 items, and C with four items.
scores = numpy.array([f(a[0]), f(a[1]), f(a[2]), f(b[0]), f(b[1]),
f(c[0]), f(c[1]), f(c[2]), f(c[3])])
rowStarts = numpy.array([0, 3, 5])
paddingValue = -1.0
scoresByGroup = groupIntoRows(scores, rowStarts, paddingValue)
The desired value of scoresByGroup would be:
[[f(a[0]), f(a[1]), f(a[2]), -1.0],
[f(b[0]), f(b[1]), -1.0, -1.0]
[f(c[0]), f(c[1]), f(c[2]), f(c[3])]]
Is there some numpy function or composition of functions I can use to create groupIntoRows?
Background:
This operation will be used in calculating the loss for a minibatch for a gradient descent algorithm in Theano, so that's why I need to keep it as a composition of numpy functions if possible, rather than falling back on native Python.
It's fine to assume there is some known maximum row size
The original objects being scored are vectors and the scoring function is a matrix multiplication, which is why we flatten things out in the first place. It would be possible to pad everything to the maximum item set size before doing the matrix multiplication, but the biggest set is over ten times bigger than the average set size, so this is undesirable for speed reasons.
Try this:
scores = np.random.rand(9)
row_starts = np.array([0, 3, 5])
row_ends = np.concatenate((row_starts, [len(scores)]))
lens = np.diff(row_ends)
pad_len = np.max(lens) - lens
where_to_pad = np.repeat(row_ends[1:], pad_len)
padding_value = -1.0
padded_scores = np.insert(scores, where_to_pad,
padding_value).reshape(-1, np.max(lens))
>>> padded_scores
array([[ 0.05878244, 0.40804443, 0.35640463, -1. ],
[ 0.39365072, 0.85313545, -1. , -1. ],
[ 0.133687 , 0.73651147, 0.98531828, 0.78940163]])