How to compute an injective/surjective map with minimum weight? - python

We are given an m x n matrix w which represents the edge weights in a complete bipartite graph K_m,n. We wish to find a map {1,...,m} -> {1,...,n} with minimal weight, which is injective or surjective. Choosing a map is equivalent to, for every vertex v in {1,...,m}, choosing exactly one edge incident to v.
Let m<=n. An injective function with minimal weight can be found by searching for the perfect matching with minimal weight. In Python, this is implemented in scipy:
import numpy as np
import scipy, scipy.optimize
w=np.random.rand(5,10)
print(scipy.optimize.linear_sum_assignment(w))
Let m>=n. How can a surjective function with minimal weight be found? I'm looking for a concrete implementation in Python.

EDIT: It turns out I might have misunderstood your question.
From injective to surjective
If m > n, and you already have an algorithm that handles the case m <= n, then swap the two components. Your algorithm will give you an injective function from the second component to the first. Take the inverse of that function; it will be a surjective function from a subset of the first component to the second component.
Using networkx
What you are looking for is a maximum-cardinality matching in a bipartite graph.
The python library networkx contains several functions for that. Standard problems are "maximum-cardinality matching" and "maximum-weight matching"; I'd be surprised if there were a function to solve your "minimum-weight maximum-cardinality matching" problem directly.
However; it looks to me as if your problem was equivalent to finding a maximum-weight matching in the weighted graph obtained by replacing every weight w by W-w, where W is some very large value (for instance, three times the maximum weight in the original graph).
By including this large value W in the weight of every edge, you're forcing the maximum-weight matching to be a maximum-cardinality matching. And by including the negative value -w, you're asking the algorithm to find edges with the smallest possible original weight in the original graph.
Documentation: networkx.algorithms.matching.max_weight_matching

Related

multivariate normal pdf with nan in mean

Is there an efficient implementation in Python to evaluate the PDF of a multivariate normal distribution when there are missing values in x? I guess the idea would just be that you'd effectively reduce the dimensionality to whatever number of available data points you had for a particular vector for which you are trying to evaluate the probability. But I can't figure out if the scipy implementation has a way to ignore masked values.
e.g.,
from scipy.stats import multivariate_normal as mvnorm
import numpy as np
means = [0.0,0.0,0.0]
cov = np.array([[1.0,0.2,0.2],[0.2,1.0,0.2],[0.2,0.2,1.0]])
d = mvnorm(means,cov)
x = [0.5,-0.2,np.nan]
d.pdf(x)
yields output:
nan
(as expected)
Is there a way to efficiently evaluate the PDF for only values that are present (in this case, making effectively 3D case into a bivariate case?) using this implementation?
This question is a bit of a tricky in terms of math and code. Let me elaborate.
First, the code. scipy.stats does not offer nan-handling as you desire. Speedy code likely requires implementing the multivariate normal distribution PDF by hand and applying it to NumPy arrays directly. Leveraging vectorization is the only way to efficiently offer this functionality for large-scale datasets. On the other hand, the nan-tolerant function nanTol_pdf() below provides the desired functionality while staying true to the multivariate normal distribution as implemented in SciPy. You might find it sufficient for your use case.
def nanTol_pdf(d, x):
'''
Function returns function value of multivariate probability density conditioned on
non-NAN indices of the input vector x
'''
assert isinstance(d, stats._multivariate.multivariate_normal_frozen) and (isinstance(x,list) or isinstance(x,np.ndarray))
# check presence of nan entries
if any(np.isnan(x)):
# indices
subIndex = np.argwhere(~np.isnan(x)).reshape(-1)
# lower-dimensional multiv. Gaussian distribution
lowDim_mean = d.mean[subIndex]
lowDim_cov = cov[np.ix_(subIndex, subIndex)]
lowDim_d = mvnorm(lowDim_mean, lowDim_cov)
return (lowDim_d.pdf(x[subIndex]))
else:
return d.pdf(x)
Regardless, the fact we can do it shouldn't stop us to think if we should.
Second, the math. Mathematically speaking, it is unclear what you attempt to achieve. In your example, SciPy returns nan as you query it with an ill-defined input vector x. Output not-defined, i.e. returning not a number (nan) seems to be the most appropriate answer. Jointly truncating the distribution d and input vector x circumvents numerical problems but opens up statistical questions. In particular, since the probability density function values cannot be understood as (conditional) probabilities. Moreover, the output alone conceals if truncation was applied. Remember that nanTol_pdf() will happily provide a non-negative real number as an output as long as at least one entry in the vector is a real number. Your use case will decide if this is reasonable.
Finally, I would suggest at least considering missing data imputation techniques before moving forward. Let me know if this helps.

Calculating the adjacency matrix and finding nearest neighbors for different node type/classes

Before I describe my problem I'll summarise what I think I'm looking for. I think I need a method for nearest-neighbor searches which are restricted by node type in python (In my case a node represents an atom and the node type represents the element the atom is). So only returning the nearest neighbors of a given type. Maybe I am wording my problem incorrectly. I haven't been able to find any existing methods for this.
I am writing some ring statistics code to find different types of rings for molecular dynamics simulation data. The input data structure is a big array of atom id, atom type, and XYZ positions.
For example.
At the moment I only consider single-element systems (for example graphene, so only Carbon atoms are present). So each node is considered the same type when finding its nearest neighbors and calculating the adjacency matrix.
For this, I am using KDTree and scipy.spatial algorithms and find all atoms within the bond length, r, from any given atom. If an atom is within, r radius of a given atom I consider it connected and then populate and update an adjacency dictionary accordingly.
def create_adjacency_dict(data, r, leaf_size=5, box_size=None):
from scipy.spatial import KDTree
tree = KDTree(data, leafsize=leaf_size,
boxsize=box_size)
all_nn_indices = tree.query_ball_point(data, r, workers=5) # Calculates neighbours within radius r of a point.
adj_dict = {}
for count, item in enumerate(all_nn_indices):
adj_dict[count] = item # Populate adjacency dictionary
for node, nodes in adj_dict.items():
if node in nodes:
nodes.remove(node) # Remove duplicates
adj_dict = {k: set(v) for k, v in adj_dict.items()}
return adj_dict
I would like to expand the code to deal with multi-species systems. For example AB2, AB2C4 etc (Where A,B and C represent different atomic species). However, I am struggling to figure out a nice way to do this.
A
/ \
B B
The obvious method would be to just do a brute force Euclidean approach. My idea is to input the bond types for a molecule, so for AB2 (shown above), you would input something like AB to indicate the different types of bonds to consider, and then the respective bond lengths. Then loop over each atom finding the distance to all other atoms and, for this example of AB2, if an atom of type A is within the bond length of an atom B, consider them connected and populate the adjacency matrix. However, I'd like to be able to use the code on large datasets of 50,000+ atoms, so this method seems wasteful.
I suppose I could still use my current method, but just search for say the 10 nearest neighbors of a given atom and then do a Euclidean search for each atom pair, following the same approach as above. Still seems like a better method would already exist though.
Do better methods already exist for this type of problem? Finding nearest neighbors restricted by node type? Or maybe someone knows a more correct wording of my problem, which is I think one of my issues here.
"Then search the data."
This sounds like that old cartoon where someone points to a humourously complex diagram with a tiny label in the middle that says "Here a miracle happens"
Seriously, I am guessing that this searching is what you need to optimize ( you do not exactly say )
In turn, this suggests that you are doing a linear search through every atom and calculating the distance of each. Could it be so!?
There is a standard answer for this problem, called an octree.
https://en.wikipedia.org/wiki/Octree
A netflix tv miniseries 'The Billion Dollar Code' dramatizes the advantages of this approach https://www.netflix.com/title/81074012

Is there a chain constraint work around for CVXPY?

I keep encountering the same issue while trying to solve an Integer Programming Problem with cvxpy particularly with the constraints.
Some background on my problem and use case. I am trying to write a program that optimizes cut locations for 3D objects. The goal is to have as few interfaces as possible, but there is a constraint that each section can only have a certain maximum length. To visualize, you could picture a tree. If you cut at the bottom, you only have to make one large cut, but if the tree is larger than the maximum allowed length (if you needed to move it with a trailer of certain length for example) you would need to make another or several more cuts along the tree. As you go further up, it is likely that in addition to the main stem of the tree, you would need to cut some smaller, side branches along the same horizontal plane. I have written a program that output the number of interfaces (or cuts) needed at many evenly spaced horizontal planes along the height of an object. Now I am trying to pass that data to a new piece of code that will perform an Integer Programming optimization to determine the best location(s) to cut the tree if it treats each of the horizontal cutting planes as either active or inactive.
Below is my code:
#Create ideal configuration to solve for
config = cp.Variable(layer_split_number, boolean=True)
#Create objective
objective = sum(config*layer_islands_data[0])
problem = cp.Problem(cp.Minimize(objective),[layer_height_constraint(layer_split_number,layer_height,config) <= ChunkingParameters.max_reach_z])
#solve
problem.solve(solver = cp.GLPK_MI)
Layer Height Constraint function
def layer_height_constraint(layer_split_number,layer_height,config):
#create array of the absolute height (relative to ground) of each layer
layer_height_array = (np.array(range(1,layer_split_number+1))+1)*layer_height
#filter set inactive cuts to 0
active_heights = layer_height_array * config
#filter out all 0's
active_heights_trim = active_heights[active_heights != 0]
#insert top and bottom values
active_heights = np.append(active_heights,[(layer_split_number+1)*layer_height])
active_heights_trim = np.insert(active_heights,0,0)
#take the difference between active cuts to find distance
active_heights_diff = np.diff(active_heights_trim)
#find the maximum of those differences
max_height = max(active_heights_diff)
return(max_height)
With this setup, I get the following error:
Cannot evaluate the truth value of a constraint or chain constraints, e.g., 1 >= x >= 0.
I know that the two problem spots are using the python 'max' function in the last step and the step in the middle where I filter out the 0s in the array (because this introduces another equality of sorts. However, I can't really think of another way to solve this or setup the constraints differently. Is it possible to have cvxpy just accept a value into the constraint? My function is setup to just output a singular maximum distance value based on a given configuration, so to me it would make sense if I could just feed it the value being tried for the configuration (array of 0s and 1s representing inactive and active cuts respectively) for the current iteration and function would return the result which can then just be compared to the maximum allowed distance. However, I'm pretty sure IP solvers are a bit more complex than just running a bunch of iterations but I don't really know.
Any help on this or ideas would be greatly appreciated. I have tried an exhaustive search of the solution space, but when I have 10 or even 50+ cuts an exhaustive search is super inefficient. I would need to try 2^n combinations for n number of potential cuts.

Python Skcriteria Package, How do I get the similarity index for all cases and not just the top case with TOPSIS?

Looking through the documentation:
https://scikit-criteria.readthedocs.io/en/latest/tutorial/quickstart.html
rank.e_.similarity shows the similarity index for the top case, but it does not specify how to get that value for other cases. My project seeks to show the index for top 10 cases of an experiment.
I tried simple indexing but without knowledge of how the object is set up I can not call the value for other cases. The values exist in the backend somewhere as it is needed to determine TOPSIS best case.
The vector stored in rank.e_.similarity contains the similarity values for each of the alternatives.
Thus rank.e_similarity[0] has the similarity index for the alternative rank.alternatives[0].
This value is calculated with the formula :
similarity = d_worst / (d_better + d_worst)
Where:
d_worstis the distance to the anti-ideal alternative
d_betteris the distance to the ideal alternative.
As of scikit-criteria 0.6 the distance metric is configurable. By default the Euclidean distance is used.
For more details regarding the ideal and anti-ideal points I recommend reading the TOPSIS wikipedia page https://en.wikipedia.org/wiki/TOPSIS

can unnormalized load centrality be non-integer?

I am trying to use the python NetworkX package to validate some other code, but I am worried that load centrality does not mean what I think. When I run the example below, I expect to only get integer values for load since it should just be a count at each node of the number of shortest paths passing through the node (i.e., I list out all the shortest paths between node pairs, then for each node 'v' count how many paths cross it, excluding paths where 'v' is the first or last node):
edges = [ ('a0','a1'),('a0','a2'),('a1','a4'),('a1','a2'),('a2','a4'),('a2','z5'),('a2','a3'),('a3','z5'),('a4','z5'),('a4','z6'),('a4','z7')
,('z5','z6'),('z5','z7'),('z5','z8'),('z6','z7'),('z6','z8'),('z6','z9'),('z7','z8'),('z7','z9'),('z8','z9')]
import networkx as nx
testg = nx.Graph( edges )
nx.load_centrality( testg, normalized=False )
I get output like this:
{'a0': 0.0,
'a1': 3.16666665,
'a2': 15.4999998,
'a3': 0.0,
'a4': 14.75,
'z5': 20.25,
'z6': 6.04166666,
'z7': 6.04166666,
'z8': 2.24999996,
'z9': 0,0}
These are similar to the values I compute by hand in terms of relative size, but why aren't they integer values? Every other network that I have tested returns integer values for unnormalized load centrality, and I don't see anything in the definition that would lead to these values. The python doc for this function says to see betweenness and also provides an article as reference for the algorithm (which I can't access).
After very extensive calculations based on the paper Downshift linked to, it looks like 'load' follows the betweenness definition in that paper but subtracts off a factor of (2n-1) to adjust for some overcounting in the algorithm. Either that, or the algorithm in the paper doesn't make clear that the initial packets of size '1' should only contribute to nodes they pass through and not to the ends of the paths. In any case, I can match the output values of networkx now. The values differ from networkx's own betweenness function which follows the formula in the documentation based on node pairs rather than propagating packets of size 1 through the network.
In particular, because the packets split into equal size at branch points, nodes can accumulate partial packets and therefore accumulate a non-integer 'load' value. That's not what the description implies in the networkx documentation, but it's clear enough now.

Categories

Resources