Related
I want to use the Hungarian assignment algorithm in python on a non-square numpy array.
My input matrix X looks like this:
X = np.array([[0.26, 0.64, 0.16, 0.46, 0.5 , 0.63, 0.29],
[0.49, 0.12, 0.61, 0.28, 0.74, 0.54, 0.25],
[0.22, 0.44, 0.25, 0.76, 0.28, 0.49, 0.89],
[0.56, 0.13, 0.45, 0.6 , 0.53, 0.56, 0.05],
[0.66, 0.24, 0.61, 0.21, 0.47, 0.31, 0.35],
[0.4 , 0.85, 0.45, 0.14, 0.26, 0.29, 0.24]])
The desired result is the matrix ordered such as X becomes X_desired_output:
X_desired_output = np.array([[0.63, 0.5 , 0.29, 0.46, 0.26, 0.64, 0.16],
[0.54, 0.74, 0.25, 0.28, 0.49, 0.12, 0.61],
[[0.49, 0.28, 0.89, 0.76, 0.22, 0.44, 0.25],
[[0.56, 0.53, 0.05, 0.6 , 0.56, 0.13, 0.45],
[[0.31, 0.47, 0.35, 0.21, 0.66, 0.24, 0.61],
[[0.29, 0.26, 0.24, 0.14, 0.4 , 0.85, 0.45]])
Here I would like to maximize the cost and not minimize so the input to the algorithm would be in theory either 1-X or simply X.
I have found https://software.clapper.org/munkres/ that leads to:
from munkres import Munkres
m = Munkres()
indices = m.compute(-X)
indices
[(0, 5), (1, 4), (2, 6), (3, 3), (4, 0), (5, 1)]
# getting the indices in list format
ii = [i for (i,j) in indices]
jj = [j for (i,j) in indices]
How can I use these to sort X ? jjonly contain 6 elements as opposed to the original 7 columns of X.
I am looking to actually get the matrix sorted.
After spending some hours working on it, I found a solution. The problem was due to the fact that X.shape[1] > X.shape[0], some columns are not assigned at all and this leads to the problem.
The documentation states that
"The Munkres algorithm assumes that the cost matrix is square.
However, it’s possible to use a rectangular matrix if you first pad it
with 0 values to make it square. This module automatically pads
rectangular cost matrices to make them square."
from munkres import Munkres
m = Munkres()
indices = m.compute(-X)
indices
[(0, 5), (1, 4), (2, 6), (3, 3), (4, 0), (5, 1)]
# getting the indices in list format
ii = [i for (i,j) in indices]
jj = [j for (i,j) in indices]
# re-order matrix
X_=X[:,jj] # re-order columns
X_=X_[ii,:] # re-order rows
# HERE IS THE TRICK: since the X is not diagonal, some columns are not assigned to the rows !
not_assigned_columns = X[:, [not_assigned for not_assigned in np.arange(X.shape[1]).tolist() if not_assigned not in jj]].reshape(-1,1)
X_desired = np.concatenate((X_, not_assigned_columns), axis=1)
print(X_desired)
array([[0.63, 0.5 , 0.29, 0.46, 0.26, 0.64, 0.16],
[0.54, 0.74, 0.25, 0.28, 0.49, 0.12, 0.61],
[0.49, 0.28, 0.89, 0.76, 0.22, 0.44, 0.25],
[0.56, 0.53, 0.05, 0.6 , 0.56, 0.13, 0.45],
[0.31, 0.47, 0.35, 0.21, 0.66, 0.24, 0.61],
[0.29, 0.26, 0.24, 0.14, 0.4 , 0.85, 0.45]])
I have a mask, which has a shape of: [64, 2895] and an array pred which has a shape of [64, 2895, 161].
mask is binary with only 0s and 1s. What I want to do is reduce pred so that it maintains 64 batches, and along the 2895, wherever there is a 1 in the mask for each batch, return the related pred.
So as a simplified example, if:
mask = [[1, 0, 0],
[1, 1, 0],
[0, 0, 1]]
pred = [[[0.12, 0.23, 0.45, 0.56, 0.57],
[0.91, 0.98, 0.97, 0.96, 0.95],
[0.24, 0.46, 0.68, 0.80, 0.15]],
[[1.12, 1.23, 1.45, 1.56, 1.57],
[1.91, 1.98, 1.97, 1.96, 1.95],
[1.24, 1.46, 1.68, 1.80, 1.15]],
[[2.12, 2.23, 2.45, 2.56, 2.57],
[2.91, 2.98, 2.97, 2.96, 2.95],
[2.24, 2.46, 2.68, 2.80, 2.15]]]
What I want is:
[[[0.12, 0.23, 0.45, 0.56, 0.57]],
[[1.12, 1.23, 1.45, 1.56, 1.57],
[1.91, 1.98, 1.97, 1.96, 1.95]],
[[2.24, 2.46, 2.68, 2.80, 2.15]]]
I realize that there are different dimensions, I hope that that's possible. If not, then fill in the missing dimensions with 0. Either numpy or pytorch would be helpful. Thank you.
If you want a vectorized computation then different dimension seems not possible, but this would give you the one with masked entry filled with 0:
# pred: torch.size([64, 2895, 161])
# mask: torch.size([64, 2895])
result = pred * mask[:, :, None]
# extend mask with another dimension so now it can do entry-wise multiplication
and result is exactly what you want
I have a simple plot of a 2D Gaussian distribution.
from scipy.stats import multivariate_normal
from matplotlib import pyplot as plt
means = [ 1.03872615e+00, -2.66927843e-05]
cov_matrix = [[3.88809050e-03, 3.90737359e-06], [3.90737359e-06, 4.28819569e-09]]
# This works
a_lims = [0.7, 1.3]
b_lims = [-5, 5]
# This does not work
a_lims = [0.700006488869478, 1.2849292618191401]
b_lims =[-5.000288311285968, 5.000099437047633]
dist = multivariate_normal(mean=means, cov=cov_matrix)
a_plot, b_plot = np.mgrid[a_lims[0]:a_lims[1]:1e-2, b_lims[0]:b_lims[1]:0.1]
pos = np.empty(a_plot.shape + (2,))
pos[:, :, 0] = a_plot
pos[:, :, 1] = b_plot
z = dist.pdf(pos)
plt.figure()
plt.contourf(a_plot, b_plot, z, cmap='coolwarm', levels=100)
If I use the limits marked as "this works", I get the following plot (correct).
However, if I use the same limits, but slightly adjusted, it plots completely wrong, because localized at different values (below).
I guess it is a bug in mgrid. Does anyone have any ideas? More specifically, why does the maximum of the distribution move?
Focusing just on the xaxis:
In [443]: a_lims = [0.7, 1.3]
In [444]: np.mgrid[a_lims[0]:a_lims[1]:1e-2]
Out[444]:
array([0.7 , 0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8 ,
0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9 , 0.91,
0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99, 1. , 1.01, 1.02,
1.03, 1.04, 1.05, 1.06, 1.07, 1.08, 1.09, 1.1 , 1.11, 1.12, 1.13,
1.14, 1.15, 1.16, 1.17, 1.18, 1.19, 1.2 , 1.21, 1.22, 1.23, 1.24,
1.25, 1.26, 1.27, 1.28, 1.29, 1.3 ])
In [445]: a_lims = [0.700006488869478, 1.2849292618191401]
In [446]: np.mgrid[a_lims[0]:a_lims[1]:1e-2]
Out[446]:
array([0.70000649, 0.71000649, 0.72000649, 0.73000649, 0.74000649,
0.75000649, 0.76000649, 0.77000649, 0.78000649, 0.79000649,
0.80000649, 0.81000649, 0.82000649, 0.83000649, 0.84000649,
0.85000649, 0.86000649, 0.87000649, 0.88000649, 0.89000649,
0.90000649, 0.91000649, 0.92000649, 0.93000649, 0.94000649,
0.95000649, 0.96000649, 0.97000649, 0.98000649, 0.99000649,
1.00000649, 1.01000649, 1.02000649, 1.03000649, 1.04000649,
1.05000649, 1.06000649, 1.07000649, 1.08000649, 1.09000649,
1.10000649, 1.11000649, 1.12000649, 1.13000649, 1.14000649,
1.15000649, 1.16000649, 1.17000649, 1.18000649, 1.19000649,
1.20000649, 1.21000649, 1.22000649, 1.23000649, 1.24000649,
1.25000649, 1.26000649, 1.27000649, 1.28000649])
In [447]: _444.shape
Out[447]: (61,)
In [449]: _446.shape
Out[449]: (59,)
mgrid when given ranges like a:b:c uses np.arange(a, b, c). arange when given float step is not reliable with regards to the end point.
mgrid lets you use np.linspace which is better for floating point steps. For example with the first set of limits:
In [453]: a_lims = [0.7, 1.3]
In [454]: np.mgrid[a_lims[0]:a_lims[1]:61j]
Out[454]:
array([0.7 , 0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8 ,
0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9 , 0.91,
0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99, 1. , 1.01, 1.02,
1.03, 1.04, 1.05, 1.06, 1.07, 1.08, 1.09, 1.1 , 1.11, 1.12, 1.13,
1.14, 1.15, 1.16, 1.17, 1.18, 1.19, 1.2 , 1.21, 1.22, 1.23, 1.24,
1.25, 1.26, 1.27, 1.28, 1.29, 1.3 ])
===
By narrowing the b_lims considerably, and generating a finer mesh, I get a nice tilted ellipse.
means = [ 1, 0]
a_lims = [0.7, 1.3]
b_lims = [-.0002,.0002]
dist = multivariate_normal(mean=means, cov=cov_matrix)
a_plot, b_plot = np.mgrid[ a_lims[0]:a_lims[1]:1001j, b_lims[0]:b_lims[1]:1001j]
So I think the difference in your plots is an artifact of an excessively coarse mesh in the vertical direction. That potentially affects both the pdf generation and the contouring.
High resolution plot with original grid points. Only one b level intersects with the high probability values. Since the ellipse is tilted the two grids sample different parts, and hence the seemingly different pdfs.
I have a array like this and would like to get returned the column numbers for each row where the value is over the threshold of 0.6:
X = array([[ 0.16, 0.40, 0.61, 0.48, 0.20],
[ 0.42, 0.79, 0.64, 0.54, 0.52],
[ 0.64, 0.64, 0.24, 0.63, 0.43],
[ 0.33, 0.54, 0.61, 0.43, 0.29],
[ 0.25, 0.56, 0.42, 0.69, 0.62]])
Result would be:
[[2],
[1, 2],
[0, 1, 3],
[2],
[3, 4]]
Is there a better way of doing this then by a double for-loop?
def get_column_over_threshold(data, threshold):
column_numbers = [[] for x in xrange(0,len(data))]
for sample in data:
for i, value in enumerate(data):
if value >= threshold:
column_numbers[i].extend(i)
return topic_predictions
For each row you can ask for the indices where the elements are greater than 0.6:
result = [where(row > 0.6) for row in X]
This performs the computation you want, but the format of result is somewhat inconvenient, since the result of where in this case is a tuple of size 1, containing a NumPy array with the indices. We can replace where with flatnonzero to get the array directly rather than the tuple. To obtain a list of lists, we explicitly cast this array to a list:
result = [list(flatnonzero(row > 0.6)) for row in X]
(In the code above I assume you have used from numpy import *)
Use np.where to get row, col indices and then use those with np.split to get list of column indices as arrays output -
In [18]: r,c = np.where(X>0.6)
In [19]: np.split(c,np.flatnonzero(r[:-1] != r[1:])+1)
Out[19]: [array([2]), array([1, 2]), array([0, 1, 3]), array([2]), array([3, 4])]
To make it more generic which would handle rows without any match, we could loop through the column indices obtained from np.where and assign into an initialized array, like so -
def col_indices_per_row(X, thresh):
mask = X>thresh
r,c = np.where(mask)
out = np.empty(len(X), dtype=object)
grp_idx = np.r_[0,np.flatnonzero(r[:-1] != r[1:])+1,len(r)]
valid_rows = r[np.r_[True,r[:-1] != r[1:]]]
for (row,i,j) in zip(valid_rows,grp_idx[:-1],grp_idx[1:]):
out[row] = c[i:j]
return out
Sample run -
In [92]: X
Out[92]:
array([[0.16, 0.4 , 0.61, 0.48, 0.2 ],
[0.42, 0.79, 0.64, 0.54, 0.52],
[0.1 , 0.1 , 0.1 , 0.1 , 0.1 ],
[0.33, 0.54, 0.61, 0.43, 0.29],
[0.25, 0.56, 0.42, 0.69, 0.62]])
In [93]: col_indices_per_row(X, thresh=0.6)
Out[93]:
array([array([2]), array([1, 2]), None, array([2]), array([3, 4])],
dtype=object)
The following function tries to normalize 3D vectors
def my_norm(v):
"""
#type v: Nx3 numpy array
"""
return v / numpy.linalg.norm(v, axis=1)[:, None]
It works when N > 1. For N=1, I got ValueError: 'axis' entry is out of bounds. I can do the following check to deal with both cases, but I wonder if there is a cleaner way?
def my_norm(v):
"""
#type v: Nx3 numpy array
"""
if len(v) == 1:
return v / numpy.linalg.norm(v)
return v / numpy.linalg.norm(v, axis=1)[:, None]
Use axis=-1 and keep the dimensions with keepdims=True -
v/np.linalg.norm(v, axis=-1,keepdims=True)
Sample runs
1D Case :
In [61]: v = np.random.rand(6)
In [62]: v/np.linalg.norm(v)
Out[62]: array([ 0.22, 0.1 , 0.28, 0.58, 0.64, 0.33])
In [63]: v/np.linalg.norm(v, axis=-1,keepdims=True)
Out[63]: array([ 0.22, 0.1 , 0.28, 0.58, 0.64, 0.33])
2D Case :
In [58]: v = np.random.rand(4,6)
In [59]: v / np.linalg.norm(v, axis=1)[:, None]
Out[59]:
array([[ 0.53, 0.04, 0.38, 0.21, 0.58, 0.43],
[ 0.49, 0.4 , 0.02, 0.56, 0.38, 0.38],
[ 0.05, 0.49, 0.45, 0.18, 0.54, 0.47],
[ 0.45, 0.61, 0.19, 0.1 , 0.14, 0.61]])
In [60]: v/np.linalg.norm(v, axis=-1,keepdims=True)
Out[60]:
array([[ 0.53, 0.04, 0.38, 0.21, 0.58, 0.43],
[ 0.49, 0.4 , 0.02, 0.56, 0.38, 0.38],
[ 0.05, 0.49, 0.45, 0.18, 0.54, 0.47],
[ 0.45, 0.61, 0.19, 0.1 , 0.14, 0.61]])