I currently have arrays that look something like this:
[ 5.23324730e-03 1.01221129e-04 5.23324730e-03 ...,]
There are 500 such rows and 64 columns. I would like to compare a row like the one above, to other rows in a similar format. That is, I want to compare the 1st element in one array to the first element in the second array and so on.
The idea is to work out how closely they match... Would anyone have any ideas how I might go about this efficiently? I should note that values may not be identical.... But if I could find values that differ by amounts under a certain threshold, that would be fine.
If anyone is wondering - I'm trying to compare SURF descriptors...
Thanks so much for your help!
You can save it as a numpy matrix and then calculate the cosine similarity of each row. This can be done efficiently using the numpy dot product product method
The question depends on your definition of closely match. One common way would be calculate euclidean distance.
How can the euclidean distance be calculated with numpy?
or
Distance between numpy arrays, columnwise
Related
I am really new to python and data science and I could really do with some help, please.
I have a dataframe with 440 observations and 6 describing variables. I am supposed to do a hierarchical clustering of the data, but ONLY with the help of numpy and pandas packages. I cannot use scipy or sklearn. So far, I was able to create the distance matrix (440x440 numpy array). I want only two clusters. Concerning the linkage method I want to use ward linkage, but centroid method would also be ok. How can I create two clusters out of the distance matrix based on the linkage criteria? I thought of something like "find the smallest distance, put the corresponding column/row value in one cluster, remove them from the distance matrix, re-do until the old matrix is empty and I have got a new matrix with tuples of as row/column index, and re-do that until I have only 2 rows/columns left which include all my original observations..."
I know, that´s not a good description but as I said I am really new to this and I am thankful for any advice.
I am having a problem struggeling me for a few days now. :
There are 17 numpy arrays with values and corresponding latitude and longitude coordinates. Each of the them contains 360*600 points. These points are overlappping at some parts. What I want to do in the end is to have a composite of the data at one regular grid.
With the common scipy.interpolate.griddata function I am having the problem that in these overlapping regions I am having different values often. This results in strange artefacts you can see in the first image:
My first idea is to take the max value of the values used in the interpolation.
I have found out that scipy.interpolate.griddata uses triangulation to interpolate but actually I can't find a pipeline that I can adapt.
I hope you can understand that I do not share any code bc. dataset is huge and my question is more about to find the best practice or receive some interesting ideas to solve this problem. Thanks in advance for your support.
Maybe calculate first the distance matrix between your regular grid points (x and the existing irregular ones y:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance_matrix.html
Then, for each point, find the indices of the k smallest distances and take the maximum value of the value on the irregular grid.
Disclamer: I don't know how it scales - and what your requirements are regarding performance.
Edit: You might be able to pre-eliminate data-sets for specific regions, to minimise the effort to calculate all the distance matrices.
I have one large scipy csr_matrix and want to calculate matrix.dot(matrix.T). However, I do not need every single dot product, but rather only those based on some rules. For eample, as specified by another binary matrix that has nonzero elements for those rows/columns that should be calculated. For eample, if rules[0,10]=1 then, the dot product between row 0 and row 10 should be determined. These rules could of course also be represented by some other data structure.
A simple solution would be to manually loop through the rules and then slice according rows/columns and determine the dot product. This does not seem to be the best solution to me, specifically as slicing is also quite expensive with sparse matrices. Maybe, someone has a better idea about how to approach this.
In case of a 2D array array.cumsum(0).cumsum(1) gives the Integral image of the array.
What happens if I compute array.cumsum(0).cumsum(1).cumsum(2) over a 3D array?
Do I get a 3D extension of Integral Image i.e, Integral volume over the array?
Its hard to visualize what happens in case of 3D.
I have gone through this discussion.
3D variant for summed area table (SAT)
This gives a recursive way on how to compute the Integral volume. What if I use the cumsum along the 3 axes. Will it give me the same thing?
Will it be more efficient than the recursive method?
Yes, the formula you give, array.cumsum(0).cumsum(1).cumsum(2), will work.
What the formula does is compute a few partial sums so that the sum of these sums is the volume sum. That is, every element needs to be summed exactly once, or, in other words, no element can be skipped and no element counted twice. I think going through each of these questions (is any element skipped or counted twice) is a good way to verify to yourself that this will work. And also run a small test:
x = np.ones((20,20,20)).cumsum(0).cumsum(1).cumsum(2)
print x[2,6,10] # 231.0
print 3*7*11 # 231
Of course, with all ones there could two errors that cancel each other out, but this wouldn't happen everywhere so it's a reasonable test.
As for efficiency, I would guess that the single pass approach is probably faster, but not by a lot. Also, the above could be sped up using an output array, eg, cumsum(n, out=temp) as otherwise three arrays will be created for this calculation. The best way to know is to test (but only if you need to).
I have two arrays, array A with ~1M lines and array B with ~400K lines. Each contains, among other things, coordinates of a point. For each point in array A, I need to find how many points in array B are within a certain distance of it. How do I avoid naively comparing everything to everything? Based on its speed at the start, running naively would take 10+ days on my machine. That required nested loops, but the arrays are too large to construct a distance matrix (400G entries!)
I thought the way would be to check only a limited set of B coordinates against each A coordinates. However, I haven't determined an easy way of doing that. That is, what's the easiest/quickest way to make a selection that doesn't require checking all the values in B (which is exactly the same task I'm trying to avoid)?
EDIT: I should've mentioned these aren't 2D (or nD) Cartesian, but spherical surface (lat/long), and distance is great-circle distance.
I cannot give a full answer right now, but some hints to get you started. It will be much more efficient to organise the points in B in a kd-tree. You can use the class scipy.spatial.KDTree to do this easily, and you can use the query() method on this class to request the points within a given distance.
Here is one possible implementation of the cross match between list of points on the sphere using k-d tree.
http://code.google.com/p/astrolibpy/source/browse/my_utils/match_lists.py
Another way is to use healpy module and their get_neighbors method.