Related
Goal
I am writing a "colocalization" script to identify unique co-localized pairs of coordinates between two sets of data. My data is quite large with <100k points in each set so performance is pretty important.
For example, I have two sets of points:
import numpy as np
points_a = np.array([[1, 1],[2, 2],[3, 3],[6, 6]])
points_b = np.array([[1, 1],[2, 3],[3, 5],[6, 6], [7,6]]) # may be longer than points_a
For each point in points_a I want to find the nearest point in points_b. However, I don't want any point in points_b used in more than one pair. I can easily find the nearest neighbors using NearestNeighbors or one of the similar routines:
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(points_b)
distances, indices = neigh.kneighbors(points_a)
print(indices)
>>> [0, 1, 1, 3]
As above, this can give me a solution where a point in point_b is used twice. I would like to instead find the solution where each point is used once while minimizing the total distance across all pairs. In the above case:
[0, 1, 2, 3]
I figure a start would be to use NearestNeighbors or similar to find nearest neighbor candidates:
from scipy.spatial import KDTree
max_search_r = 3
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(points_b)
distances, indices = neigh.radius_neighbors(points_a, max_search_radius)
print(distances)
print(indices)
>>>[[0, 2,23], [1.41, 1], [2.82, 1, 2], [0, 1]]
>>>[[0, 1], [0, 1], [0, 1, 2], [0, 1]]
This shrinks down the overall search parameters but I am unclear how I can then compute the global optimum. I stumbled across this post: Find optimal unique neighbour pairs based on closest distance
but the solution is for only a single set of points and I am unclear how I could translate the method to my case.
Any advice would be greatly appreciated!
Update
Hey all. With everyone's advice I found a somewhat working solution:
import numpy as np
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix, csgraph
def colocalize_points(points_a: np.ndarray, points_b: np.ndarray, r: int):
""" Find pairs that minimize global distance. Filters out anything outside radius `r` """
neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(points_b)
distances, b_indices = neigh.radius_neighbors(points_a, radius=r)
# flatten and get indices for A. This will also drop points in A with no matches in range
d_flat = np.hstack(distances) + 1
b_flat = np.hstack(b_indices)
a_flat = np.array([i for i, neighbors in enumerate(distances) for n in neighbors])
# filter out A points that cannot be matched
sm = csr_matrix((d_flat, (a_flat, b_flat)))
a_matchable = csgraph.maximum_bipartite_matching(sm, perm_type='column')
sm_filtered = sm[a_matchable != -1]
# now run the distance minimizing matching
row_match, col_match = csgraph.min_weight_full_bipartite_matching(sm_filtered)
return row_match, col_match
Only issue I have is that by filtering the matrix with maximum_bipartite_matching I cannot be sure I truly have the best result since it just returns the first match. For example, if I have 2 points in A [[2,2][3,3]] whose only candidate match is [3,3], maximum_bipartite_matching will keep whichever appears first. So if [2,2] appears first in the matrix, [3,3] will be dropped despite being a better match.
Update 1
To address comments below, here is my reasoning why maximum_bipartite_matching does not give me the desired solution. Consider points:
points_a = np.array([(1, 1), (2, 2), (3, 3)])
points_b = np.array([(1, 1), (2, 2), (3, 5), (2, 3)])
The optimal a,b point pairing that minimizes distance will be:
[(1, 1): (1, 1),
(2, 2): (2, 2),
(3, 3): (2, 3)]
However if I run the following:
neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(points_b)
distances, b_indices = neigh.radius_neighbors(points_a, radius=3)
# flatten and get indices for A. This will also drop points in A with no matches in range
d_flat = np.hstack(distances) + 1
b_flat = np.hstack(b_indices)
a_flat = np.array([i for i, neighbors in enumerate(distances) for n in neighbors])
# filter out A points that cannot be matched
sm = csr_matrix((d_flat, (a_flat, b_flat)))
a_matchable = csgraph.maximum_bipartite_matching(sm, perm_type='column')
print([(points_a[a], points_b[i]) for i, a in enumerate(a_matchable)])
I get solution:
[(1, 1): (1, 1),
(2, 2): (2, 2),
(3, 3): (3, 5)]
Swapping the last two points in points_b will give me the expected solution. This indicates to me that the algorithm is not taking the distance (weight) into account and instead just tries to maximize the number of connections. I could very well have made a mistake though so please let me know.
Sorry if the title isn't very descriptive, but what I want is the following.
I have a DataArray with coordinates x, y and t. I also have a list of N coordinates and I'd like to interpolate to get a list of N interpolated values. However, I don't quite know how to do that with xarray while still taking advantage of the parallelism of dask. Here's an example with random values:
import numpy as np
import xarray as xr
x = np.linspace(0, 1, 10)
datar = xr.DataArray(np.random.randn(10,10,10), dims=('x', 'y', 't'), coords=dict(x=x,
y=x,
t=x))
datar = datar.chunk(dict(t=1))
points = np.array([(0.1, 0.1, 0.1),
(0.2, 0.3, 0.3),
(0.6, 0.6, 0.6),
])
ivals = []
for point in points:
x0, y0, t0 = point
interp_val = datar.interp(x=x0, y=y0, t=t0)
ivals.append(float(interp_val))
print(ivals)
This gives me the correct result of [-1.7047738779949937, 0.9568015637947849, 0.04437392968785547].
Is there any way to achieve the same result but taking advantage of dask?
If I naively pass lists to the interpolating function I get a 3 cubed matrix instead:
In [35]: x0s, y0s, t0s = points.T
...: print(datar.interp(x=x0s, y=y0s, t=t0s))
...:
<xarray.DataArray (x: 3, y: 3, t: 3)>
dask.array<dask_aware_interpnd, shape=(3, 3, 3), dtype=float64, chunksize=(3, 3, 3), chunktype=numpy.ndarray>
Coordinates:
* x (x) float64 0.1 0.2 0.6
* y (y) float64 0.1 0.3 0.6
* t (t) float64 0.1 0.3 0.6
A bit late, but in order to interpolate the way you want, and not having a cube as a result, you should cast your coordinates as xarray DataArrays with a fictitious dimension points:
import numpy as np
import xarray as xr
np.random.seed(1234)
x = np.linspace(0, 1, 10)
datar = xr.DataArray(np.random.randn(10, 10, 10), dims=('x', 'y', 't'), coords=dict(x=x, y=x, t=x))
datar = datar.chunk(dict(t=1))
points = np.array([(0.1, 0.1, 0.1),
(0.2, 0.3, 0.3),
(0.6, 0.6, 0.6)])
x = xr.DataArray(points[:, 0], dims="points")
y = xr.DataArray(points[:, 1], dims="points")
t = xr.DataArray(points[:, 2], dims="points")
datar.interp(x=x, y=y, t=t).values
It gives you the three values tou want. Two remarks :
you should time the executions of the two methods, your loop for and my solution, to check if xarray really takes advantage of the multiple points given to interp,
you give the correct values you expect, but they depend on your random data. You should fix the seed before in order to give reproducible examples ;)
I have some data that I want to fit to a distribution. The data is given by the frequency. What I mean is, I have every event that I have observed and the number of times that I have observed it. So something like:
data = [(1, 34), (2, 1023), (3, 3243), (4, 879), (5, 202), (6, 10)]
where the first number in each tuple is the event I have observed, and the second number is the total observations for that event.
With Scipy, I can fit (for example) a lognormal distribution using a call to scipy.stats.lognorm.fit. However, this routine expects to see a list of all of the observations, not the frequencies. I can fit the distribution like this:
import scipy
temp_data = []
for x in data:
temp_data += [x[0]] * x[1]
params = scipy.stats.lognorm.fit(temp_data)
but wow, that seems horribly inefficient.
Is there a to fit a distribution, in Scipy or other similar tool, based upon the frequencies? If not, is there a better way to fit the distribution without having to create a potentially giant list of values?
Unfortunately, looking at the source, it seems like the 'materialized' aspect of the data is hardcoded. The function's not that complicated, though, so you could make your own version. TBH if your total N is still manageable I'd probably just do data = np.array(data); expanded_data = np.repeat(data[:,0], data[:,1]) despite the inefficiency, because life is short.
Another alternative would be to use pomegranate, which supports passing weights:
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
import pomegranate as pg
data = [(1, 34), (2, 1023), (3, 3243), (4, 879), (5, 202), (6, 10)]
data = np.array(data)
expanded = np.repeat(data[:,0], data[:,1].astype(int))
scipy_shape, _, scipy_scale = scipy_params = scipy.stats.lognorm.fit(expanded, floc=0)
scipy_sigma, scipy_mu = scipy_shape, np.log(scipy_scale)
pg_dist = pg.LogNormalDistribution(0, 1)
pg_dist.fit(data[:,0], weights=data[:,1])
pg_mu, pg_sigma = pg_dist.parameters
fig = plt.figure()
ax = fig.add_subplot(111)
x = np.linspace(0.1, 10, 100)
ax.plot(data[:,0], data[:, 1] / data[:,1].sum(), label="freq")
ax.plot(x, scipy.stats.lognorm(*scipy_params).pdf(x),
label=r"scipy: $\mu$ {:1.3f} $\sigma$ {:1.3f}".format(scipy_mu, scipy_sigma), alpha=0.5)
ax.plot(x, pg_dist.probability(x),
label=r"pomegranate: $\mu$ {:1.3f} $\sigma$ {:1.3f}".format(pg_mu, pg_sigma), linestyle='--', alpha=0.5)
ax.legend(loc='upper right')
fig.savefig("compare.png")
gives me
You can draw a random sample according to you frequency distribution, and fit that:
import scipy
import numpy as np
data = np.array(
[(1, 34), (2, 1023), (3, 3243), (4, 879), (5, 202), (6, 10)],
dtype=float,
)
values = data[0]
weights = data[1]
seed = 87
gen = np.random.default_rng(seed)
sample = gen.choices(
values, size=500, p=weights/np.sum(weights))
params = scipy.stats.lognorm.fit(values)
A very simple question:
how to compute efficiently in Python (or Cython) the following quantity.
Given the list of polygons in 3D (polygon
There is a list of polygons given in the following form:
vertex = np.array([[0, 0, 0], [0, 0, 1], [0, 1, 0],[1, 0, 0],[0.5, 0.5, 0.5]], order = 'F').T
polygons = np.array([3, 0, 1, 2, 4, 1, 2, 3 ,4])
i.e. polygon is a 1D array, which contains entries of the form [N,i1,i2,i3,i4,...],
N is the number of vertices in a polygons and then the id numbers of the vertices in the vertex array (in the example above there is one triangle with 3 vertices [0,1,2] and one polygon with 4 vertices [1,2,3,4]
I need to compute the information: a list of all edges and for each edge the information
which faces contain this edge.
And I need to do it fast: the number of vertices can be large.
Update
The polygon is closed, i.e. a polygon [4, 0, 1, 5, 7] means that there are 4 vertices and edges are 0-1, 1-5, 5-7, 7-0
The face is a synonim to polygon in fact.
Dunno if this is the fastest option, most probably not, but it works. I think the slowest part is edges.index((v, polygon[i + 1])) where we have to find if this edge is already in list. Vertex array is not really needed since edge is a pair of vertex indexes. I used face_index as a reference to polygon index since you didn't write what face is.
vertex = [[0,0,0], [0,0,1], [0,1,0],[1,0,0],[0.5,0.5,0.5]]
polygons = [3,0,1,2,4,1,2,3,4]
_polygons = polygons
edges = []
faces = []
face_index = 0
while _polygons:
polygon = _polygons[1:_polygons[0] + 1]
polygon.append(polygon[0])
_polygons = _polygons[_polygons[0] + 1:]
for i, v in enumerate(polygon[0:-1]):
if not (v, polygon[i + 1]) in edges:
edges.append((v, polygon[i + 1]))
faces.append([face_index, ])
else:
faces[edges.index((v, polygon[i + 1]))].append(face_index)
face_index += 1
edges = map(lambda edge, face: (edge, face), edges, faces)
print edges
<<< [((0, 1), [0]), ((1, 2), [0, 1]), ((2, 0), [0]), ((2, 3), [1]), ((3, 4), [1]), ((4, 1), [1])]
You can make it faster by removing line polygon.append(polygon[0]) and append first vertice of polygon to vertices list in polygon manually, which shouldn't be a problem.
I mean change polygons = [3,0,1,2,4,1,2,3,4] into polygons = [3,0,1,2,0,4,1,2,3,4,1].
PS Try to use PEP8. It is a code typing style. It says that you should put a space after every comma in iterables so it's eaasier to read.
I have two numpy arrays that are OpenCV convex hulls and I want to check for intersection without creating for loops or creating images and performing numpy.bitwise_and on them, both of which are quite slow in Python. The arrays look like this:
[[[x1 y1]]
[[x2 y2]]
[[x3 y3]]
...
[[xn yn]]]
Considering [[x1 y1]] as one single element, I want to perform intersection between two numpy ndarrays. How can I do that? I have found a few questions of similar nature, but I could not figure out the solution to this from there.
You can use a view of the array as a single dimension to the intersect1d function like this:
def multidim_intersect(arr1, arr2):
arr1_view = arr1.view([('',arr1.dtype)]*arr1.shape[1])
arr2_view = arr2.view([('',arr2.dtype)]*arr2.shape[1])
intersected = numpy.intersect1d(arr1_view, arr2_view)
return intersected.view(arr1.dtype).reshape(-1, arr1.shape[1])
This creates a view of each array, changing each row to a tuple of values. It then performs the intersection, and changes the result back to the original format. Here's an example of using it:
test_arr1 = numpy.array([[0, 2],
[1, 3],
[4, 5],
[0, 2]])
test_arr2 = numpy.array([[1, 2],
[0, 2],
[3, 1],
[1, 3]])
print multidim_intersect(test_arr1, test_arr2)
This prints:
[[0 2]
[1 3]]
you can use http://pypi.python.org/pypi/Polygon/2.0.4, here is an example:
>>> import Polygon
>>> a = Polygon.Polygon([(0,0),(1,0),(0,1)])
>>> b = Polygon.Polygon([(0.3,0.3), (0.3, 0.6), (0.6, 0.3)])
>>> a & b
Polygon:
<0:Contour: [0:0.60, 0.30] [1:0.30, 0.30] [2:0.30, 0.60]>
To convert the result of cv2.findContours to Polygon point format, you can:
points1 = contours[0].reshape(-1,2)
This will convert the shape from (N, 1, 2) to (N, 2)
Following is a full example:
import Polygon
import cv2
import numpy as np
from scipy.misc import bytescale
y, x = np.ogrid[-2:2:100j, -2:2:100j]
f1 = bytescale(np.exp(-x**2 - y**2), low=0, high=255)
f2 = bytescale(np.exp(-(x+1)**2 - y**2), low=0, high=255)
c1, hierarchy = cv2.findContours((f1>120).astype(np.uint8),
cv2.cv.CV_RETR_EXTERNAL,
cv2.CHAIN_APPROX_SIMPLE)
c2, hierarchy = cv2.findContours((f2>120).astype(np.uint8),
cv2.cv.CV_RETR_EXTERNAL,
cv2.CHAIN_APPROX_SIMPLE)
points1 = c1[0].reshape(-1,2) # convert shape (n, 1, 2) to (n, 2)
points2 = c2[0].reshape(-1,2)
import pylab as pl
poly1 = pl.Polygon(points1, color="blue", alpha=0.5)
poly2 = pl.Polygon(points2, color="red", alpha=0.5)
pl.figure(figsize=(8,3))
ax = pl.subplot(121)
ax.add_artist(poly1)
ax.add_artist(poly2)
pl.xlim(0, 100)
pl.ylim(0, 100)
a = Polygon.Polygon(points1)
b = Polygon.Polygon(points2)
intersect = a&b # calculate the intersect polygon
poly3 = pl.Polygon(intersect[0], color="green") # intersect[0] are the points of the polygon
ax = pl.subplot(122)
ax.add_artist(poly3)
pl.xlim(0, 100)
pl.ylim(0, 100)
pl.show()
Output:
So this is what I did to get the job done:
import Polygon, numpy
# Here I extracted and combined some contours and created a convex hull from it.
# Now I wanna check whether a contour acquired differently intersects with this hull or not.
for contour in contours: # The result of cv2.findContours is a list of contours
contour1 = contour.flatten()
contour1 = numpy.reshape(contour1, (int(contour1.shape[0]/2),-1))
poly1 = Polygon.Polygon(contour1)
hull = hull.flatten() # This is the hull is previously constructued
hull = numpy.reshape(hull, (int(hull.shape[0]/2),-1))
poly2 = Polygon.Polygon(hull)
if (poly1 & poly2).area()<= some_max_val:
some_operations
I had to use for loop, and this altogether looks a bit tedious, although it gives me expected results. Any better methods would be greatly appreciated!
inspired by jiterrace's answer
I came across this post while working with Udacity deep learning class(trying to find the overlap between training and test data).
I am not familiar with "view" and found the syntax a bit hard to understand, probably the same when I try to communicate to my friends who think in "table".
My approach is basically to flatten/reshape the ndarray of shape (N, X, Y) into shape (N, X*Y, 1).
print(train_dataset.shape)
print(test_dataset.shape)
#(200000L, 28L, 28L)
#(10000L, 28L, 28L)
1). INNER JOIN (easier to understand, slow)
import pandas as pd
%%timeit -n 1 -r 1
def multidim_intersect_df(arr1, arr2):
p1 = pd.DataFrame([r.flatten() for r in arr1]).drop_duplicates()
p2 = pd.DataFrame([r.flatten() for r in arr2]).drop_duplicates()
res = p1.merge(p2)
return res
inters_df = multidim_intersect_df(train_dataset, test_dataset)
print(inters_df.shape)
#(1153, 784)
#1 loop, best of 1: 2min 56s per loop
2). SET INTERSECTION (fast)
%%timeit -n 1 -r 1
def multidim_intersect(arr1, arr2):
arr1_new = arr1.reshape((-1, arr1.shape[1]*arr1.shape[2])) # -1 means row counts are inferred from other dimensions
arr2_new = arr2.reshape((-1, arr2.shape[1]*arr2.shape[2]))
intersected = set(map(tuple, arr1_new)).intersection(set(map(tuple, arr2_new))) # list is not hashable, go tuple
return list(intersected) # in shape of (N, 28*28)
inters = multidim_intersect(train_dataset, test_dataset)
print(len(inters))
# 1153
#1 loop, best of 1: 34.6 s per loop