find Markov steady state with left eigenvalues (using numpy or scipy) - python

I need to find the steady state of Markov models using the left eigenvectors of their transition matrices using some python code.
It has already been established in this question that scipy.linalg.eig fails to provide actual left eigenvectors as described, but a fix is demonstrated there. The official documentation is mostly useless and incomprehensible as usual.
A bigger problem than than the incorrect format is that the eigenvalues produced are not in any particular order (not sorted and different each time). So if you want to find the left eigenvectors that correspond to the 1 eigenvalues you have to hunt for them, and this poses it's own problem (see below). The math is clear, but how to get python to compute this and return the correct eigenvectors is not clear. Other answers to this question, like this one, don't seem to be using the left eigenvectors, so those can't be correct solutions.
This question provides a partial solution, but it doesn't account for the unordered eigenvalues of larger transition matrices. So, just using
leftEigenvector = scipy.linalg.eig(A,left=True,right=False)[1][:,0]
leftEigenvector = leftEigenvector / sum(leftEigenvector)
is close, but doesn't work in general because the entry in the [:,0] position may not be the eigenvector for the correct eigenvalue (and in my case it is usually not).
Okay, but the output of scipy.linalg.eig(A,left=True,right=False) is an array in which the [0] element is a list of each eigenvalue (not in any order) and that is followed in position [1] by an array of eigenvectors in a corresponding order to those eigenvalues.
I don't know a good way to sort or search that whole thing by the eigenvalues to pull out the correct eigenvectors (all eigenvectors with eigenvalue 1 normalized by the sum of the vector entries.) My thinking is to get the indices of the eigenvalues that equal 1, and then pull those columns from the array of eigenvectors. My version of this is slow and cumbersome. First I have a function (that doesn't quite work) to find positions in a last that matches a value:
# Find the positions of the element a in theList
def findPositions(theList, a):
return [i for i, x in enumerate(theList) if x == a]
Then I use it like this to get the eigenvectors matching the eigenvalues = 1.
M = transitionMatrix( G )
leftEigenvectors = scipy.linalg.eig(M,left=True,right=False)
unitEigenvaluePositions = findPositions(leftEigenvectors[0], 1.000)
steadyStateVectors = []
for i in unitEigenvaluePositions:
thisEigenvector = leftEigenvectors[1][:,i]
thisEigenvector / sum(thisEigenvector)
steadyStateVectors.append(thisEigenvector)
print steadyStateVectors
But actually this doesn't work. There is one eigenvalue = 1.00000000e+00 +0.00000000e+00j that is not found even though two others are.
My expectation is that I am not the first person to use python to find stationary distributions of Markov models. Somebody who is more proficient/experienced probably has a working general solution (whether using numpy or scipy or not). Considering how popular Markov models are I expected there to be a library just for them to perform this task, and maybe it does exist but I couldn't find one.

You linked to How do I find out eigenvectors corresponding to a particular eigenvalue of a matrix? and said it doesn't compute the left eigenvector, but you can fix that by working with the transpose.
For example,
In [901]: import numpy as np
In [902]: import scipy.sparse.linalg as sla
In [903]: M = np.array([[0.5, 0.25, 0.25, 0], [0, 0.1, 0.9, 0], [0.2, 0.7, 0, 0.1], [0.2, 0.3, 0, 0.5]])
In [904]: M
Out[904]:
array([[ 0.5 , 0.25, 0.25, 0. ],
[ 0. , 0.1 , 0.9 , 0. ],
[ 0.2 , 0.7 , 0. , 0.1 ],
[ 0.2 , 0.3 , 0. , 0.5 ]])
In [905]: eval, evec = sla.eigs(M.T, k=1, which='LM')
In [906]: eval
Out[906]: array([ 1.+0.j])
In [907]: evec
Out[907]:
array([[-0.32168797+0.j],
[-0.65529032+0.j],
[-0.67018328+0.j],
[-0.13403666+0.j]])
In [908]: np.dot(evec.T, M).T
Out[908]:
array([[-0.32168797+0.j],
[-0.65529032+0.j],
[-0.67018328+0.j],
[-0.13403666+0.j]])
To normalize the eigenvector (which you know should be real):
In [913]: u = (evec/evec.sum()).real
In [914]: u
Out[914]:
array([[ 0.18060201],
[ 0.36789298],
[ 0.37625418],
[ 0.07525084]])
In [915]: np.dot(u.T, M).T
Out[915]:
array([[ 0.18060201],
[ 0.36789298],
[ 0.37625418],
[ 0.07525084]])
If you don't know the multiplicity of eigenvalue 1 in advance, see #pv.'s comment showing code using scipy.linalg.eig. Here's an example:
In [984]: M
Out[984]:
array([[ 0.9 , 0.1 , 0. , 0. , 0. , 0. ],
[ 0.3 , 0.7 , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0.25, 0.75, 0. , 0. ],
[ 0. , 0. , 0.5 , 0.5 , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. , 1. ],
[ 0. , 0. , 0. , 0. , 1. , 0. ]])
In [985]: import scipy.linalg as la
In [986]: evals, lvecs = la.eig(M, right=False, left=True)
In [987]: tol = 1e-15
In [988]: mask = abs(evals - 1) < tol
In [989]: evals = evals[mask]
In [990]: evals
Out[990]: array([ 1.+0.j, 1.+0.j, 1.+0.j])
In [991]: lvecs = lvecs[:, mask]
In [992]: lvecs
Out[992]:
array([[ 0.9486833 , 0. , 0. ],
[ 0.31622777, 0. , 0. ],
[ 0. , -0.5547002 , 0. ],
[ 0. , -0.83205029, 0. ],
[ 0. , 0. , 0.70710678],
[ 0. , 0. , 0.70710678]])
In [993]: u = lvecs/lvecs.sum(axis=0, keepdims=True)
In [994]: u
Out[994]:
array([[ 0.75, -0. , 0. ],
[ 0.25, -0. , 0. ],
[ 0. , 0.4 , 0. ],
[ 0. , 0.6 , 0. ],
[ 0. , -0. , 0.5 ],
[ 0. , -0. , 0.5 ]])
In [995]: np.dot(u.T, M).T
Out[995]:
array([[ 0.75, 0. , 0. ],
[ 0.25, 0. , 0. ],
[ 0. , 0.4 , 0. ],
[ 0. , 0.6 , 0. ],
[ 0. , 0. , 0.5 ],
[ 0. , 0. , 0.5 ]])

All right, I had to make some changes while implementing Warren's solution and I've included those below. It's basically the same so he gets all the credit, but the realities of numerical approximations with numpy and scipy required more massaging which I thought would be helpful to see for others trying to do this in the future. I also changed the variable names to be super noob-friendly.
Please let me know if I got anything wrong or there are further recommended improvements (e.g. for speed).
# in this case my Markov model is a weighted directed graph, so convert that nx.graph (G) into it's transition matrix
M = transitionMatrix( G )
#create a list of the left eigenvalues and a separate array of the left eigenvectors
theEigenvalues, leftEigenvectors = scipy.linalg.eig(M, right=False, left=True)
# for stationary distribution the eigenvalues and vectors are always real, and this speeds it up a bit
theEigenvalues = theEigenvalues.real
leftEigenvectors = leftEigenvectors.real
# set how close to zero is acceptable as being zero...1e-15 was too low to find one of the actual eigenvalues
tolerance = 1e-10
# create a filter to collect the eigenvalues that are near enough to zero
mask = abs(theEigenvalues - 1) < tolerance
# apply that filter
theEigenvalues = theEigenvalues[mask]
# filter out the eigenvectors with non-zero eigenvalues
leftEigenvectors = leftEigenvectors[:, mask]
# convert all the tiny and negative values to zero to isolate the actual stationary distributions
leftEigenvectors[leftEigenvectors < tolerance] = 0
# normalize each distribution by the sum of the eigenvector columns
attractorDistributions = leftEigenvectors / leftEigenvectors.sum(axis=0, keepdims=True)
# this checks that the vectors are actually the left eigenvectors, but I guess it's not needed to usage
#attractorDistributions = np.dot(attractorDistributions.T, M).T
# convert the column vectors into row vectors (lists) for each attractor (the standard output for this kind of analysis)
attractorDistributions = attractorDistributions.T
# a list of the states in any attractor with the approximate stationary distribution within THAT attractor (e.g. for graph coloring)
theSteadyStates = np.sum(attractorDistributions, axis=1)
Putting that all together in an easy copy-and-paste format:
M = transitionMatrix( G )
theEigenvalues, leftEigenvectors = scipy.linalg.eig(M, right=False, left=True)
theEigenvalues = theEigenvalues.real
leftEigenvectors = leftEigenvectors.real
tolerance = 1e-10
mask = abs(theEigenvalues - 1) < tolerance
theEigenvalues = theEigenvalues[mask]
leftEigenvectors = leftEigenvectors[:, mask]
leftEigenvectors[leftEigenvectors < tolerance] = 0
attractorDistributions = leftEigenvectors / leftEigenvectors.sum(axis=0, keepdims=True)
attractorDistributions = attractorDistributions.T
theSteadyStates = np.sum(attractorDistributions, axis=0)
Using this analysis on a generated Markov model produced one attractor (of three) with a steady state distribution of 0.19835218 and 0.80164782 compared to the mathematically accurate values of 0.2 and 0.8. So that's more than 0.1% off, kind of a big error for science. That's not a REAL problem because if accuracy is important then, now that the individual attractors have been identified, a more accurate analyses of behavior within each attractor using a matrix subset can be done.

Related

How to analyze all traces in a graph analyzed with minimum spanning trees?

The Minimum Spanning Tree algorithm (MST) allows to find the shortest possible connexion for all nodes. In Python, Scipy provides a package to analyze MST. I want to find a way to get the main traces based on the connexions. In the example underneath, there is a trace at dest2-dest4-dest1 and another one at source-dest4-dest3 (other combination of traces are possible as well).
from scipy.sparse.csgraph import minimum_spanning_tree
import pandas as pd
import numpy as np
The example has the following coordinate points for source and destinations
df = pd.DataFrame([[2, 2], [30, 2], [2, 30], [25, 25], [14,10]], columns=['xcord', 'ycord'], index=['source', 'dest1', 'dest2', 'dest3', 'dest4'])
A distance matrix and minimum spanning tree with resulting distances can be computed.
dm = pd.DataFrame(distance_matrix(df.values, df.values), index=df.index, columns=df.index)
mst = minimum_spanning_tree(dm)
arr_distances = mst.toarray()
This gives the following array.
array([[ 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ],
[14.4222051 , 17.88854382, 23.32380758, 18.60107524, 0. ]])
Which means, distance source-dest4 = 14.42, dest1-dest4 = 17.89, dest2-dest4 = 23.32 and dest3-dest4 = 18.60.
How can one capture the available main trace(s) expressed in a list or numpy.array like ['source', 'dest4', 'dest3'] and ['dest2', 'dest4', 'dest1'] ? This should work for any configuration of input coordinates.
So far I have tried to get the main connexions using np.argwhere(arr_distances > 0), but somehow it is difficult to get them as aforementioned traces in different lists or sublists.
It would be good if the solution were in Python, but any other package than Scipy is welcome. Thank you very much in advance for your help!

How to constrict your problem to allow a custom-defined "cool-off period"?

I am making use of CVXPY and CVXOPT to solve a problem that allows us to control the running of motors in the cheapest possible ways.
We have existing constraints which consist of, cost, volume, and level which need to be maintained (all of which work, and is only to give a further definition of the problem at hand). We use this to minimize the cost and provide an optimized approach.
Recently, we wish to add a new constraint that will allow for a cool-off period. Which we can set. So if the selection value is 1, the next two periods should be a 0 if this makes sense. Essentially, if the pump runs for a period it should be off for the next two to cool down.
Please see below for a current function that is being used. Currently, no cool-off constraint has been implemented, and wish to get a push in the right direction of how best to implement such a solution.
def optimiser(self, cost_, volume_,v_min,flow_,min_level,max_level,initial_level,period_lengths,out_flow_, errors) :
"""
This function optimizes the regime possible combinations using convex optimization. We assign and define the problem and uses `GLPK_MI` to solve our problem.
Parameters
----------
cost_
Numpy Array
volume_
Numpy Array
v_min
Float -> minimum volume
flow_
Numpy Array
min_level
Float -> minimum level
max_level
Float -> maximum level
initial_level
Float -> initial level
period_lengths
Integer -> number of period lengths
out_flow_
Numpy Array
errors
Exceptions -> error handling
Returns
----------
selection
Solved problem with the best possible combination for pumping regime.
"""
FACTOR = 0.001*1800*self.RESERVOIR_VOLUME
input_flow_matrix=np.zeros((max(period_lengths),len(period_lengths)))
for i,l in enumerate(period_lengths):
input_flow_matrix[:l,i]=1
selection = cp.Variable(shape=cost_.shape,boolean=True)
assignment_constraint = cp.sum(selection,axis=1) == 1
input_flow_= cp.sum(cp.multiply(flow_,selection),axis=1)
input_flow_vector=cp.vec(cp.multiply(input_flow_matrix,np.ones((max(period_lengths), 1)) # cp.reshape(input_flow_,(1,len(period_lengths)))))
res_flow= (input_flow_vector-cp.vec(out_flow_))
res_level=cp.cumsum(res_flow) * FACTOR + initial_level
volume_= cp.sum(cp.multiply(volume_,selection))
volume_constraint = volume_ >= v_min
min_level_constraint = res_level >= min_level
max_level_constraint = res_level <= max_level
if errors is LevelTooLowError:
constraints = [assignment_constraint, max_level_constraint, volume_constraint]
elif errors is LevelTooHighError:
constraints = [assignment_constraint, min_level_constraint, volume_constraint]
elif errors is MaxVolumeExceededError:
constraints = [assignment_constraint, min_level_constraint, volume_constraint]
else:
constraints = [assignment_constraint, max_level_constraint, min_level_constraint, volume_constraint]
cost_ = cp.sum(cp.multiply(cost_,selection))
assign_prob = cp.Problem(cp.Minimize(cost_),constraints)
assign_prob.solve(solver=cp.GLPK_MI, verbose=False)
return selection
Sample of data which we work with
Here is our cost_ array
[[0. 0.8 ]
[0. 0.8 ]
[0. 0.8 ]
[0. 0.8 ]
[0. 0.8 ]
[0. 0.8 ]
[0. 0.8 ]
[0. 0.8 ]
[0. 0.9 ]
[0. 0.9 ]
[0. 0.9 ]
[0. 0.9 ]
[0. 0.9 ]
[0. 0.9 ]
[0. 0.9 ]
[0. 0.9 ]
[0. 1.35]
[0. 1.35]
[0. 1.35]
[0. 1.8 ]
[0. 1.2 ]]
And here is our volume_ array
[[ 0. 34200.]
[ 0. 34200.]
[ 0. 34200.]
[ 0. 34200.]
[ 0. 34200.]
[ 0. 34200.]
[ 0. 34200.]
[ 0. 34200.]
[ 0. 34200.]
[ 0. 34200.]
[ 0. 34200.]
[ 0. 34200.]
[ 0. 34200.]
[ 0. 34200.]
[ 0. 34200.]
[ 0. 34200.]
[ 0. 51300.]
[ 0. 51300.]
[ 0. 51300.]
[ 0. 68400.]
[ 0. 51300.]]
I hope my problem has been well defined for anyone to understand what I wish to do at the moment. If anything else is needed, I can happily provide this for anyone.
Thanks in advance.
Something like this is sometimes used in power generation. A generator may have to cool down before allowing it to be turned on again.
Let's assume we have a binary variable x[t] indicating if the generator is "on" at period t i.e.
x[t] = 1 if machine is "on"
0 "off"
Then we want to require:
if x[t-1]=1 and x[t]=0 => x[t+1] = 0
This reads as: if turned off at t, then also stay off at t+1. This can be written as a linear constraint:
x[t+1] <= x[t] - x[t-1] + 1 for all periods t
If we read "if the pump runs for a period it should be off for the next two to cool down" as indicating that in addition, the maximum up-time is 1, then we can simply write:
x[t-1] + x[t] + x[t+1] <= 1 for all t
This will allow patterns like [0,0,0],[1,0,0],[0,1,0],[0,0,1] but forbids patterns like [1,1,1],[0,1,1],[1,1,0],[1,0,1].

Finite HMM's and their Digamma matrix

I've been starting at my textbook for 3 days as well as supplementary Matlab code, and I'm simply not grasping how the Gamma matrix looks when the transition matrix has an exit state.
Usually, digamma would have t-1 elements, and each element would be a matrix of the transition matrix dimension (right? Digamma_t_ij)... so, say my toy example is
A = [[0.9, 0.1, 0.0],
[0.0, 0.9, 0.1]]
x = [-0.2, 2.6, 1.3]
The emission probabilities for state 1 and 2 are modelled by gaussian distributions. I'm disregarding r for the moment, i.e. several sequences.
In my mind digamma should have the shape np.zeros((sequence_length, self.nStates, len(self.nStates + 1 ))) where the +1 is for the exit state...
Although, here's an exert from my textbook:
"We therefore supplement the ξr matrix with one additional column", how... what... I feel so stupid :(
This is what I'm getting atm
Digamma
array([[[0.06495786, 0.93504214, 0. ],
[0. , 0. , 0. ]],
[[0. , 0.06495786, 0. ],
[0. , 0.93504214, 0. ]],
[[0. , 0. , 0. ],
[0. , 0. , 0. ]]])
I'm guessing the first two elements are right as the values in each matrix add up to 1, so changing their end column just seems wrong... but filling in the last matrix doesn't seem right either?
I'll add all the variables needed to calculate digamma, incase someone can explain this to me and would prefer not to come up with a toy example:
# a_pass from the Forward Algorithm
a_pass = array([[1. , 0.38470424, 0.41887466],
[0. , 0.61529576, 0.58112534]])
# b_pass from the Backward Algorithm
b_pass = array([[1. , 1.03893571, 0. ],
[8.41537925, 9.35042138, 2.08182773]])
# scale variables derived from Forward
c_scale = array([1. , 0.16252347, 0.82658096, 0.05811253])
# Transition matrix
A = array([[0.9, 0.1, 0. ],
[0. , 0.9, 0.1]])
# Emission matrix, calculated as getting the pdf for each x given a gaussian dist
B = array([[1. , 0.06947052, 1. ],
[0.14182701, 1. , 0.81107303]])
# Observation sequence
x = [-0.2, 2.6, 1.3]
# The normalised gamma
gamma = array([[1. , 0.06495786, 0. ],
[0. , 0.93504214, 1. ]])

How to store a distance matrix more efficiently?

I have this python code to calculate coordinates distances among different points.
IDs,X,Y,Z
0-20,193.722,175.733,0.0998975
0-21,192.895,176.727,0.0998975
7-22,187.065,178.285,0.0998975
0-23,192.296,178.648,0.0998975
7-24,189.421,179.012,0.0998975
8-25,179.755,179.347,0.0998975
8-26,180.436,179.288,0.0998975
7-27,186.453,179.2,0.0998975
8-28,178.899,180.92,0.0998975
The code works perfectly, but as the amount of coordinates I now have is very big (~50000) I need to optimise this code, otherwise is impossible to run. Could someone suggest me a way of doing this that is more memory efficient? Thanks for any suggestion.
#!/usr/bin/env python
import pandas as pd
import scipy.spatial as spsp
df_1 =pd.read_csv('Spots.csv', sep=',')
coords = df_1[['X', 'Y', 'Z']].to_numpy()
distances = spsp.distance_matrix(coords, coords)
df_1['dist'] = distances.tolist()
# CREATES columns d0, d1, d2, d3
dist_cols = df_1['IDs']
df_1[dist_cols] = df_1['dist'].apply(pd.Series)
df_1.to_csv("results_Spots.csv")
There are a couple of ways to save space. The first is to only store the upper triangle of your matrix and make sure that your indices always reflect that. The second is only to store the values that meet your threshold. This can be done collectively by using sparse matrices, which support most of the operations you will likely need, and will only store the elements you need.
To store half the data, preprocess your indices when you access your matrix. So for your matrix, access index [i, j] like this:
getitem(A, i, j):
if i > j:
i, j = j, i
return dist[i, j]
scipy.sparse supports a number of sparse matrix formats: BSR, Coordinate, CSR, CSC, Diagonal, DOK, LIL. According to the usage reference, the easiest way to construct a matrix is using DOK or LIL format. I will show the latter for simplicity, although the former may be more efficient. I will leave it up to the reader to benchmark different options once a basic functioning approach has been shown. Remember to convert to CSR or CSC format when doing matrix math.
We will sacrifice speed for spatial efficiency by constructing one row at a time:
N = coords.shape[0]
threshold = 2
threshold2 = threshold**2 # minor optimization to save on some square roots
distances = scipy.sparse.lil_matrix((N, N))
for i in range(N):
# Compute square distances
d2 = np.sum(np.square((coords[i + 1:, :] - coords[i])), axis=1)
# Threshold
mask = np.flatnonzero(d2 <= threshold2)
# Apply, only compute square root if necessary
distances[i, mask + i + 1] = np.sqrt(d2[mask])
For your toy example, we find that there are only four elements that actually pass threshold, making the storage very efficient:
>>> distances.nnz
4
>>> distances.toarray()
array([[0. , 1.29304486, 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 1.1008038 , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0.68355102, 0. , 1.79082802],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ]])
Using the result from scipy.spatial.distance_matrix confirms that these numbers are in fact accurate.
If you want to fill the matrix (effectively doubling the storage, which should not be prohibitive), you should probably move away from LIL format before doing so. Simply add the transpose to the original matrix to fill it out.
The approach shown here addresses your storage concerns, but you can improve the efficiency of the entire computation using spatial sorting and other geospatial techniques. For example, you could use scipy.spatial.KDTree or the similar scipy.spatial.cKDTree to arrange your dataset and query neighbors within a specific threshold directly and efficiently.
For example, the following would replace the matrix construction shown here with what is likely a more efficient method:
tree = scipy.spatial.KDTree(coords)
distances = tree.sparse_distance_matrix(tree, threshold)
You are asking in your code for point to point distances in a ~50000 x ~50000 matrix. The result will be very big, if you really like to store it. The matrix is dense as each point has a positive distance to each other point.
I recommend to revisit your business requirements. Do you really need to calculate all these points upfront and store them in a file on a disk ? Sometimes it is better to do the required calculations on the fly; scipy.spacial is fast, perhaps even not much slower then reading a precalculated value.
EDIT (based on comment):
You can filter calculated distances by a threshold (here for illustration: 5.0) and then look up the IDs in the DataFrame
import pandas as pd
import scipy.spatial as spsp
df_1 =pd.read_csv('Spots.csv', sep=',')
coords = df_1[['X', 'Y', 'Z']].to_numpy()
distances = spsp.distance_matrix(coords, coords)
adj_5 = np.argwhere(distances[:] < 5.0)
pd.DataFrame(zip(df_1['IDs'][adj_5[:,0]].values,
df_1['IDs'][adj_5[:,1]].values),
columns=['from', 'to'])

KNN model returns same distances with any k

I am trying to create a basic item based recommender system with knn. But with the following code it always returns same distances with different k's with the model. Why it returns same results?
df_ratings = pd.read_csv('ml-1m/ratings.dat', names=["user_id", "movie_id", "rating", "timestamp"],
header=None, sep='::', engine='python')
matrix_df = df_ratings.pivot(index='movie_id', columns='user_id', values='rating').fillna(0).astype(bool).astype(int)
um_matrix = scipy.sparse.csr_matrix(matrix_df.values)
# knn model
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=17, n_jobs=-1)
model_knn.fit(um_matrix)
distances, indices = model_knn.kneighbors(um_matrix[int(movie)], n_neighbors=100)
Your model returns the same distances for any K because your K does not change the distances between your datapoints.
K-Nearest-Neigbours does simply find the nearest neigbours of a point in your feature space, the K does spceify how many of them you are looking for, not how far they are away from each other.
A simple example would be
X = [[0,0],[0,5],[5,0],[5,5][4,4]]
As a scatter plot it looks like
so your distance matrix defines the distances between all points:
[0,0]: [0. , 5. , 5. , 5.65685425, 7.07106781],
[0,5]: [0. , 4.12310563, 5. , 5. , 7.07106781],
[5,0]: [0. , 4.12310563, 5. , 5. , 7.07106781],
[5,5]: [0. , 1.41421356, 5. , 5. , 7.07106781],
[4,4]: [0. , 1.41421356, 4.12310563, 4.12310563, 5.65685425]]
The first row shows the distances from point [0,0] to any other point
to itself its 0
to [0,5] the distance is 5
to [5,0] the distance is 5
to [4,4] its (in my case euklidian distance) the squareroot of
4*4+4*4 so 5.65..
to [5,5] the euklidian distance is 7.07106781
So no matter how many points you are looking for (K) the distances are allways the same.

Categories

Resources