I'm trying to implement Integer Programming for Nearest Neighbor Classifier in python using cvxpy.
Short intro
Given a dataset of n points with a color (red or blue) we would like to choose the minimal number of candidate points, s.t for each point that isn`t a candidate, its closest candidate has the same color.
My flow
Given a set of n points (with colors) define an indicator vector I (|I| = n),
I_i = 1 if and only if vertex i is chosen as a candidate
In addition, I defined two more vectors, named as A and B (|A| = |B| = n) as follow:
A_i = the distance between v_i to it's closest candidate with the **same** color
B_i = the distance between v_i to it's closest candidate with a **different** color
Therefore, I have n constrains which are:
B_i > A_i
for any i
My target is to minimize the sum of vector I (which represents the number of candidates)
My Issue
Its seems that the vectors A, B are changing because they affected by I, since when a candidate is chosen, it is affecting its entry in I which affects A and B and the constrains are dependent on those vectors..
Any suggestions?
Thanks !
To recap: you want to find the smallest set of examples belonging to a given training set such that the resulting nearest neighbor classifier achieves perfect accuracy on that training set.
I would suggest that you formulate this as follows. Create a 0–1 variable x(e) for each example e indicating whether e is chosen. For each ordered pair of examples e and e′ with different labels, write a constraint
x(e′) ≤ ∑e′′∈C(e,e′) x(e′′)
where C(e, e′) is the set of examples e′′ with the same label as e such that e′′ is closer to e than e′ is to e (including e′′ = e). This means that, if e′ is chosen, then it is not the nearest chosen example to e.
We also need
∑e x(e) ≥ 1
to disallow the empty set. Finally, the objective is
minimize ∑e x(e).
Related
First I generated an NxN matrix of zeros and ones using NumPy. After that, I generated a copy matrix from the previous matrix, I replaced the ones in the first matrix with the weight of the edges. ( The matrix is symmetric and connected and undirected and its diagonal is zero like the original matrix) and I used BSF to check if it's connected and I found it connected every time. Then I used SciPy to find the MST (Minimum Spanning Tree). After that, I illustrated the MST using Network X
for generating NxN Matrix of zeros and ones
base = np.zeros((shape,shape))
for _ in range(100):
a = np.random.randint(shape)
b = np.random.randint(shape)
if a != b:
base[a, b] = 1
base[b, a] = 1
for generating NxN Matrix with the weight of edges
# Fetch the location of the 1s.
Weightofedges = base
ones = np.argwhere(Weightofedges == 1)
ones = ones[ones[:, 0] < ones[:, 1], :]
# Assign random values.
for a, b in ones:
Weightofedges[a, b] = Weightofedges[b, a] = np.random.randint(100)
Find the MST using SciPy
from scipy.sparse.csgraph import minimum_spanning_tree
X = minimum_spanning_tree(Weightofedges)
print("The Output Of The MST By Kruskal Algorithm:")
print(" Edges: Weights:")
print(X)
print("-----------------------")
my_matrix3 = X.toarray().astype(int)
The Problem: When I input a matrix with a large number of nodes I got some nodes not connected with an edge
e.g.
Number Of Nodes equals 75
Number Of Edges equals 65
In the MST the edges must be N-1 where N is the number of nodes
This is the graph using N = 75 ( as shown there are nodes without edges )
enter image description here
You have created a weighted version of the Erdős–Rényi model - to be exact the ER model G(n,M) with n nodes and M edges. Currently, you have fixed M=100 and you observe for n>60 that your becomes disconnected. This is typical and (at least for the second ER model variant G(n,p) with n nodes and probability of an edge p) you can even calculate the threshold where you (almost surely) get a single/large connected component. But even without the math, you can intuitively see that it becomes difficult to connect 75 nodes with only 100 random edges.
I recommend that you check out the networkx package, if you want to do more with graphs on python. For example, the implementation of the G(n,p) variant: erdos_renyi_graph.
I've implemented a k-means algorithm and performance is highly dependent on how centroids were initialized. I'm finding random uniform initialization to give a good k-means about 5% of the time, whereas with k-means++, it's closer to 50%. Why is the yield for good k-means so low? I should disclaim I've only used a handful of data sets and my good/bad rates are indicative of only those, not broadly.
Here's an example using k-means++ where the end result was not great. The Dunn Index of this clustering is 0.16.
And an example where it worked perfectly with a Dunn Index of 0.67.
I was maybe under the naive impression k-means++ produced a good k-means every time. Is there perhaps something wrong with my code?
def initialize_centroids(points, k):
"""
Parameters:
points : a list of Points.
k : how many centroids to place.
Returns:
A list of centroids.
"""
clusters = []
clusters.append(choice(points)) # first centroid is random point
for _ in range(k - 1): # for other centroids
distances = []
for p in points:
d = inf
for c in clusters: # find the minimal distance between p and c
d = min(d, distance(p, c))
distances.append(d)
# find maximum distance index from minimal distances
clusters.append(points[distances.index(max(distances))])
return clusters
This is adapted from the algorithm as found on Wikipedia:
Choose one center uniformly at random from among the data points.
For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
Repeat Steps 2 and 3 until k centers have been chosen.
Now that the initial centers have been chosen, proceed using standard k-means clustering.
The difference is the centroids are chosen such that it is the furthest distance, not a probability to choose between furthest distances.
My intention is to compare the Dunn Index over different values of k, and empirically the Dunn Index being higher means better clustering. I can't collect (good) data if half of the time it doesn't work, so my results are skewed due to the faultiness of k-means++ or my implementation thereof.
What other initialization strategies can be employed to get a more consistent result?
This question is about creating a K-nearest neighbor graph [KNNG] from a dataset with an unknown number of centroids (which is not the same as K-means clustering).
Suppose that you have a dataset of observations stored in a data matrix X[n_samples, n_features] with each row being an observation or feature vector and each column being a feature. Now suppose you want to compute the (weighted) k-Neighbors graph for points in X using sklearn.neighbors.kneighbors_graph.
What are the basic methods to pick the number of neighbors to use for each sample? What algorithms scale well when you have lots of observations?
I have seen this brute force method below but it doesn't do well when the sample dataset size becomes large and you have to pick a good starting upper bound for n_neighbors_max. Does this algorithm have a name?
def autoselect_K(X, n_neighbors_max, threshold):
# get the pairwise euclidean distance between every observation
D = sklearn.metrics.pairwise.euclidean_distances(X, X)
chosen_k = n_neighbors_max
for k in range(2, n_neighbors_max):
k_avg = []
# loop over each row in the distance matrix
for row in D:
# sort the row from smallest distance to largest distance
sorted_row = numpy.sort(row)
# calculate the mean of the smallest k+1 distances
k_avg.append(numpy.mean(sorted_row[0:k]))
# find the median of the averages
kmedian_dist = numpy.median(k_avg)
if kmedian_dist >= threshold:
chosen_k = k
break
# return the number of nearest neighbors to use
return chosen_k
From your code, it appears that you are looking for a classification result based on the nearest neighbour.
In such a case your search over the distance matrix is akin to a brute force search and defeats the purpose of Nearest neighbour algorithms.
Perhaps what you are looking for is the NNClassifier. Here https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
Regarding the choice of the number of nearest neighbours, this depends on the sparsity of your data. It helps to view Nearest Neighbour as a way to bound your search. Rather than look over all samples. It will allow you
to narrow the search to the top-N (nearest neighbour) samples. Afterward
you can apply a domain specific technique on these N samples to get the desired result.
I am trying to calculate with Pyspark the mutual information of a continuous variable and a categorical one, without having to bin the continuous variable (cf. https://journals.plos.org/plosone/articleid=10.1371/journal.pone.0087357).
The formula from the article above requires that I calculate, on each point of my categorical variable, the K nearest neighbours in terms of a distance defined on the continuous variable, only accounting the data points from the same categorical class as the current point. A good value of K is usually 3.
For example on this dataset: ( please note there is no redundancy, I only showed 2 features here).
Dataset sorted by metric
If k = 3 and I am on the point A = (0.023, Orange), I find the 3 nearest neighbours in terms of metric with categories = Orange, so it will be:
(0, Orange), (0, Orange), (0.11, Orange).
Once I have found these neighbours, I need to find the distance of the furthest one and define it as my diameter around A (here 0.11 - 0.023), and then find the number of neighbours in the whole dataset, which I will call m.
Once we have the diameter and m for each point, we use them to calculate a number Ni for each point, and we do the mean over the whole dataset.
I am having trouble to achieve the code that loops over the points in my pyspark dataset ( .rdd.map) and does the whole operation of finding the diameter and the number of neighbours m. I tried using window functions but it was hard to define the range since the range is not constant.
Thank you.
I have a dataset of about 3 million samples (each with just 3 features). I'm using scikit's sklearn.neighbors module - specifically radius_neighbor_graph - to find which samples fall within a small radius of a specific sample.
This works fine, but unsurprisingly it's really, really slow to compute this graph.
It's also very wasteful, because I only ever need to know the neighbors for a small subset of my samples (~ 100,000 of them) - and I know this subset in advance.
So... is there any way of being more efficient by calculating the neighbours within a given radius for just this subset of samples? It seems like it should be simple, but I can't think of an easy way of doing it.
First of all, the task of creating a radius-neighborhood-graph involves reading the N by N distance-matrix associated to your dataset. Since distance matrices have nice properties you can save some time, but still complexity lies somewhere in O(N^2). Here N is the number of data points in your data set X.
So one could say, that only a small number of n < N points are of interest as the center of a neighborhood, but the majority of points are just interesting as neighbors. This would result in an n by N distance matrix, where row i contains the distances of data point i to each other data point j, 1 <= i <= n, 1 <= j <= N. But this "distance matrix" has none of the desirable properties of a normal distance matrix (it is not even a square matrix), that you could use to speed up the process of creating an epsilon-neighborhood-graph.
Therefore I don't think that you find a predefined function for your case. If you want to build one your own, the steps should be as follows: Let X be your data set and i be the data point of interest.
Create the distance matrix D associated to your data set, use scipy.spatial.distance_matrix and take as x the small subset of your data set and as y the whole data set.
Create a list, neighbors = []
Loop over the i'th row of the distance matrix. If D(i,j) < epsilon, then save j in neighbors. It is the index of a data point in the epsilon neighborhood of i.
Return neighbors
Of course the computation of the distance matrix should happen once at the beginning (maybe in init() if you wrap everything up in a class), and the function/method that returns all epsilon neighbors of a data point should only depend on the index of the data point in question.
Hope this helps!