Is a Fuzzy C-Means algorithm available for Python? - python

I have some dots in a 3 dimensional space and would like to cluster them. I know Pythons module "cluster", but it has only K-Means. Do you know a module which has FCM (Fuzzy C-Means)?
(If you know some other python modules which are related to clustering you could name them as a bonus. But the important question is the one for a FCM-algorithm in python.)
Matlab
It seems to be quite easy to use FCM in Matlab (example). Isn't something like this available for Python?
NumPy, SciPy and Sage
I didn't find FCM in NumPy, SciPy or Sage. I've downloaded the documentation and searched for it. No results
Python-cluster
It seems like the cluster module will add fuzzy C-Means with the next version (see Roadmap). But I need it now

PEACH will provide some Fuzzy C-Means functionality:
http://code.google.com/p/peach/
However there doesn't seem to be any usable documentation as the wiki is empty. An example for using FCM with PEACH can be found on its website.

Have a look at scikit-fuzzy package. It has the very basic fuzzy logic functionality, including fuzzy c-means clustering.

Python
There is a fuzzy-c-means package in the PyPI. Check out the link : fuzzy-c-means Python
This is the simplest way to use FCM in python. Hope it helps.

I have done it from scratch, using K++ initialization (with fixed seeds and 5 centroids. It should't be too difficult to addapt it to your desired number of centroids):
# K++ initialization Algorithm:
import random
def initialize(X, K):
C = [X[0]]
for k in range(1, K):
D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])
probs = D2/D2.sum()
cumprobs = probs.cumsum()
np.random.seed(20) # fixxing seeds
#random.seed(0) # fixxing seeds
r = scipy.rand()
for j,p in enumerate(cumprobs):
if r < p:
i = j
break
C.append(X[i])
return C
a = initialize(data2,5) # "a" is the centroids initial array... I used 5 centroids
# Now the Fuzzy c means algorithm:
m = 1.5 # Fuzzy parameter (it can be tuned)
r = (2/(m-1))
# Initial centroids:
c1,c2,c3,c4,c5 = a[0],a[1],a[2],a[3],a[4]
# prepare empty lists to add the final centroids:
cc1,cc2,cc3,cc4,cc5 = [],[],[],[],[]
n_iterations = 10000
for j in range(n_iterations):
u1,u2,u3,u4,u5 = [],[],[],[],[]
for i in range(len(data2)):
# Distances (of every point to each centroid):
a = LA.norm(data2[i]-c1)
b = LA.norm(data2[i]-c2)
c = LA.norm(data2[i]-c3)
d = LA.norm(data2[i]-c4)
e = LA.norm(data2[i]-c5)
# Pertenence matrix vectors:
U1 = 1/(1 + (a/b)**r + (a/c)**r + (a/d)**r + (a/e)**r)
U2 = 1/((b/a)**r + 1 + (b/c)**r + (b/d)**r + (b/e)**r)
U3 = 1/((c/a)**r + (c/b)**r + 1 + (c/d)**r + (c/e)**r)
U4 = 1/((d/a)**r + (d/b)**r + (d/c)**r + 1 + (d/e)**r)
U5 = 1/((e/a)**r + (e/b)**r + (e/c)**r + (e/d)**r + 1)
# We will get an array of n row points x K centroids, with their degree of pertenence
u1.append(U1)
u2.append(U2)
u3.append(U3)
u4.append(U4)
u5.append(U5)
# now we calculate new centers:
c1 = (np.array(u1)**2).dot(data2) / np.sum(np.array(u1)**2)
c2 = (np.array(u2)**2).dot(data2) / np.sum(np.array(u2)**2)
c3 = (np.array(u3)**2).dot(data2) / np.sum(np.array(u3)**2)
c4 = (np.array(u4)**2).dot(data2) / np.sum(np.array(u4)**2)
c5 = (np.array(u5)**2).dot(data2) / np.sum(np.array(u5)**2)
cc1.append(c1)
cc2.append(c2)
cc3.append(c3)
cc4.append(c4)
cc5.append(c5)
if (j>5):
change_rate1 = np.sum(3*cc1[j] - cc1[j-1] - cc1[j-2] - cc1[j-3])/3
change_rate2 = np.sum(3*cc2[j] - cc2[j-1] - cc2[j-2] - cc2[j-3])/3
change_rate3 = np.sum(3*cc3[j] - cc3[j-1] - cc3[j-2] - cc3[j-3])/3
change_rate4 = np.sum(3*cc4[j] - cc4[j-1] - cc4[j-2] - cc4[j-3])/3
change_rate5 = np.sum(3*cc5[j] - cc5[j-1] - cc5[j-2] - cc5[j-3])/3
change_rate = np.array([change_rate1,change_rate2,change_rate3,change_rate4,change_rate5])
changed = np.sum(change_rate>0.0000001)
if changed == 0:
break
print(c1) # to check a centroid coordinates c1 - c5 ... they are the last centroids calculated, so supposedly they converged.
print(U) # this is the degree of pertenence to each centroid (so n row points x K centroids columns).
I know it is not very pythonic, but I hope it can be a starting point for your complete fuzzy C means algorithm. I think that "soft clustering" is the way to go when data is not easily separable (for example, when "t-SNE visualization" show all data together instead of showing groups clearly separated. In this case, forcing data to pertain strictly to only one clustering can be dangerous). I would give a try with m = 1.1, to m = 2.0, so you can see how the fuzzy parameter affects to the pertenence matrix.

Related

How to compute the Topological Overlap Measure [TOM] for a weighted adjacency matrix in Python?

I'm trying to calculate the weighted topological overlap for an adjacency matrix but I cannot figure out how to do it correctly using numpy. The R function that does the correct implementation is from WGCNA (https://www.rdocumentation.org/packages/WGCNA/versions/1.67/topics/TOMsimilarity). The formula for computing this (I THINK) is detailed in equation 4 which I believe is correctly reproduced below.
Does anyone know how to implement this correctly so it reflects the WGCNA version?
Yes, I know about rpy2 but I'm trying to go lightweight on this if possible.
For starters, my diagonal is not 1 and the values have no consistent error from the original (e.g. not all off by x).
When I computed this in R, I used the following:
> library(WGCNA, quiet=TRUE)
> df_adj = read.csv("https://pastebin.com/raw/sbAZQsE6", row.names=1, header=TRUE, check.names=FALSE, sep="\t")
> df_tom = TOMsimilarity(as.matrix(df_adj), TOMType="unsigned", TOMDenom="min")
# ..connectivity..
# ..matrix multiplication (system BLAS)..
# ..normalization..
# ..done.
# I've uploaded it to this url: https://pastebin.com/raw/HT2gBaZC
I'm not sure where my code is incorrect. The source code for the R version is here but it's using C backend scripts? which is very difficult for me interpret.
Here is my implementation in Python :
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
def get_iris_data():
iris = load_iris()
# Iris dataset
X = pd.DataFrame(iris.data,
index = [*map(lambda x:f"iris_{x}", range(150))],
columns = [*map(lambda x: x.split(" (cm)")[0].replace(" ","_"), iris.feature_names)])
y = pd.Series(iris.target,
index = X.index,
name = "Species")
return X, y
# Get data
X, y = get_iris_data()
# Create an adjacency network
# df_adj = np.abs(X.T.corr()) # I've uploaded this part to this url: https://pastebin.com/raw/sbAZQsE6
df_adj = pd.read_csv("https://pastebin.com/raw/sbAZQsE6", sep="\t", index_col=0)
A_adj = df_adj.values
# Correct TOM from WGCNA for the A_adj
# See above for code
# https://www.rdocumentation.org/packages/WGCNA/versions/1.67/topics/TOMsimilarity
df_tom__wgcna = pd.read_csv("https://pastebin.com/raw/HT2gBaZC", sep="\t", index_col=0)
# My attempt
A = A_adj.copy()
dimensions = A.shape
assert dimensions[0] == dimensions[1]
d = dimensions[0]
# np.fill_diagonal(A, 0)
# Equation (4) from http://dibernardo.tigem.it/files/papers/2008/zhangbin-statappsgeneticsmolbio.pdf
A_tom = np.zeros_like(A)
for i in range(d):
a_iu = A[i]
k_i = a_iu.sum()
for j in range(i+1, d):
a_ju = A[:,j]
k_j = a_ju.sum()
l_ij = np.dot(a_iu, a_ju)
a_ij = A[i,j]
numerator = l_ij + a_ij
denominator = min(k_i, k_j) + 1 - a_ij
w_ij = numerator/denominator
A_tom[i,j] = w_ij
A_tom = (A_tom + A_tom.T)
There is a package called GTOM (https://github.com/benmaier/gtom) but it is not for weighted adjacencies. The author of GTOM also took a look at this problem (which a much more sophisticated/efficient NumPy implementation but it's still not producing the expected results).
Does anyone know how to reproduce the WGCNA implementation?
EDIT: 2019.06.20
I've adapted some of the code from #scleronomic and #benmaier with credits in the doc string. The function is available in soothsayer from v2016.06 and on. Hopefully this will allow people to use topological overlap in Python easier instead of only being able to use R.
https://github.com/jolespin/soothsayer/blob/master/soothsayer/networks/networks.py
import numpy as np
import soothsayer as sy
df_adj = sy.io.read_dataframe("https://pastebin.com/raw/sbAZQsE6")
df_tom = sy.networks.topological_overlap_measure(df_adj)
df_tom__wgcna = sy.io.read_dataframe("https://pastebin.com/raw/HT2gBaZC")
np.allclose(df_tom, df_tom__wgcna)
# True
First let's look at the parts of the equation for the case of a binary adjacency matrix a_ij:
a_ij: indicates if node i is connected to node j
k_i: count of the neighbors of node i (connectivity)
l_ij: count of the common neighbors of node i and node j
so w_ij measures how many of the neighbors of the node with the lower connectivity are also neighbors of the other node (ie. w_ij measures "their relative inter-connectedness").
My guess is that they define the diagonal of A to be zero instead of one.
With this assumption I can reproduce the values of WGCNA.
A[range(d), range(d)] = 0 # Assumption
L = A # A # Could be done smarter by using the symmetry
K = A.sum(axis=1)
A_tom = np.zeros_like(A)
for i in range(d):
for j in range(i+1, d):
numerator = L[i, j] + A[i, j]
denominator = min(K[i], K[j]) + 1 - A[i, j]
A_tom[i, j] = numerator / denominator
A_tom += A_tom.T
A_tom[range(d), range(d)] = 1 # Set diagonal to 1 by default
A_tom__wgcna = np.array(pd.read_csv("https://pastebin.com/raw/HT2gBaZC",
sep="\t", index_col=0))
print(np.allclose(A_tom, A_tom__wgcna))
An intuition why the diagonal of A should be zero instead of one can be seen for a simple example with a binary A:
Graph Case Zero Case One
B A B C D A B C D
/ \ A 0 1 1 1 A 1 1 1 1
A-----D B 1 0 0 1 B 1 1 0 1
\ / C 1 0 0 1 C 1 0 1 1
C D 1 1 1 0 D 1 1 1 1
The given description of equation 4 explains:
Note that w_ij = 1 if the node with fewer connections satisfies two conditions:
(a) all of its neighbors are also neighbors of the other node and
(b) it is connected to the other node.
In contrast, w_ij = 0 if i and j are un-connected and the two nodes do not share any neighbors.
So the connection between A-D should fulfill this criterion and be w_14=1.
Case Zero Diagonal:
Case One Diagonal:
What is still missing when applying the formula is that the diagonal values don't match. I set them to one by default. What is the inter-connectedness of a node with itself anyway? A value different than one (or zero, depending on definition) doesn't make sense to me.
Neither Case Zero nor Case One result in w_ii=1 in the simple example.
In Case Zero it would be necessary that k_i+1 == l_ii, and in Case One it would be necessary that k_i == l_ii+1, which both seems wrong to me.
So to summarize I would set the diagonal of the adjacency matrix to zero, use the given equation and set the diagonal of the result to one by default.
Given adjacency matrix A, its possible to calculate the TOM matrix W without the use of for loops, which speeds up the process tremendously
L = np.dot(A,A)
k = np.sum(A,axis=0); d = len(k); tile = np.tile(k,(d,1))
K = np.min(np.stack((tile,tile.T),axis=2),axis=2)
W = (L + A)/(K + 1 - A); np.fill_diagonal(W, 1)

Is my problem suited for convex optimization, and if so, how to express it with cvxpy?

I have an array of scalars of m rows and n columns. I have a Variable(m) and a Variable(n) that I would like to find solutions for.
The two variables represent values that need to be broadcast over the columns and rows respectively.
I was naively thinking of writing the variables as Variable((m, 1)) and Variable((1, n)), and adding them together as if they're ndarrays. However, that doesn't work, as broadcasting is not allowed.
import cvxpy as cp
import numpy as np
# Problem data.
m = 3
n = 4
np.random.seed(1)
data = np.random.randn(m, n)
# Construct the problem.
x = cp.Variable((m, 1))
y = cp.Variable((1, n))
objective = cp.Minimize(cp.sum(cp.abs(x + y + data)))
# or:
#objective = cp.Minimize(cp.sum_squares(x + y + data))
prob = cp.Problem(objective)
result = prob.solve()
print(x.value)
print(y.value)
This fails on the x + y expression: ValueError: Cannot broadcast dimensions (3, 1) (1, 4).
Now I'm wondering two things:
Is my problem indeed solvable using convex optimization?
If yes, how can I express it in a way that cvxpy understands?
I'm very new to the concept of convex optimization, as well as cvxpy, and I hope I described my problem well enough.
I offered to show you how to represent this as a linear program, so here it goes. I'm using Pyomo, since I'm more familiar with that, but you could do something similar in PuLP.
To run this, you will need to first install Pyomo and a linear program solver like glpk. glpk should work for reasonable-sized problems, but if you are finding it's taking too long to solve, you could try a (much faster) commercial solver like CPLEX or Gurobi.
You can install Pyomo via pip install pyomo or conda install -c conda-forge pyomo. You can install glpk from https://www.gnu.org/software/glpk/ or via conda install glpk. (I think PuLP comes with a version of glpk built-in, so that might save you a step.)
Here's the script. Note that this calculates absolute error as a linear expression by defining one variable for the positive component of the error and another for the negative part. Then it seeks to minimize the sum of both. In this case, the solver will always set one to zero since that's an easy way to reduce the error, and then the other will be equal to the absolute error.
import random
import pyomo.environ as po
random.seed(1)
# ~50% sparse data set, big enough to populate every row and column
m = 10 # number of rows
n = 10 # number of cols
data = {
(r, c): random.random()
for r in range(m)
for c in range(n)
if random.random() >= 0.5
}
# define a linear program to find vectors
# x in R^m, y in R^n, such that x[r] + y[c] is close to data[r, c]
# create an optimization model object
model = po.ConcreteModel()
# create indexes for the rows and columns
model.ROWS = po.Set(initialize=range(m))
model.COLS = po.Set(initialize=range(n))
# create indexes for the dataset
model.DATAPOINTS = po.Set(dimen=2, initialize=data.keys())
# data values
model.data = po.Param(model.DATAPOINTS, initialize=data)
# create the x and y vectors
model.X = po.Var(model.ROWS, within=po.NonNegativeReals)
model.Y = po.Var(model.COLS, within=po.NonNegativeReals)
# create dummy variables to represent errors
model.ErrUp = po.Var(model.DATAPOINTS, within=po.NonNegativeReals)
model.ErrDown = po.Var(model.DATAPOINTS, within=po.NonNegativeReals)
# Force the error variables to match the error
def Calculate_Error_rule(model, r, c):
pred = model.X[r] + model.Y[c]
err = model.ErrUp[r, c] - model.ErrDown[r, c]
return (model.data[r, c] + err == pred)
model.Calculate_Error = po.Constraint(
model.DATAPOINTS, rule=Calculate_Error_rule
)
# Minimize the total error
def ClosestMatch_rule(model):
return sum(
model.ErrUp[r, c] + model.ErrDown[r, c]
for (r, c) in model.DATAPOINTS
)
model.ClosestMatch = po.Objective(
rule=ClosestMatch_rule, sense=po.minimize
)
# Solve the model
# get a solver object
opt = po.SolverFactory("glpk")
# solve the model
# turn off "tee" if you want less verbose output
results = opt.solve(model, tee=True)
# show solution status
print(results)
# show verbose description of the model
model.pprint()
# show X and Y values in the solution
for r in model.ROWS:
print('X[{}]: {}'.format(r, po.value(model.X[r])))
for c in model.COLS:
print('Y[{}]: {}'.format(c, po.value(model.Y[c])))
Just to complete the story, here's a solution that's closer to your original example. It uses cvxpy, but with the sparse data approach from my solution.
I don't know the "official" way to do elementwise calculations with cvxpy, but it seems to work OK to just use the standard Python sum function with a lot of individual cp.abs(...) calculations.
This gives a solution that is very slightly worse than the linear program, but you may be able to fix that by adjusting the solution tolerance.
import cvxpy as cp
import random
random.seed(1)
# Problem data.
# ~50% sparse data set
m = 10 # number of rows
n = 10 # number of cols
data = {
(i, j): random.random()
for i in range(m)
for j in range(n)
if random.random() >= 0.5
}
# Construct the problem.
x = cp.Variable(m)
y = cp.Variable(n)
objective = cp.Minimize(
sum(
cp.abs(x[i] + y[j] + data[i, j])
for (i, j) in data.keys()
)
)
prob = cp.Problem(objective)
result = prob.solve()
print(x.value)
print(y.value)
I did not get the idea, but just some hacky stuff based on the assumption:
you want some cvxpy-equivalent to numpy's broadcasting-rules behaviour on arrays (m, 1) + (1, n)
So numpy-wise:
m = 3
n = 4
np.random.seed(1)
a = np.random.randn(m, 1)
b = np.random.randn(1, n)
a
array([[ 1.62434536],
[-0.61175641],
[-0.52817175]])
b
array([[-1.07296862, 0.86540763, -2.3015387 , 1.74481176]])
a + b
array([[ 0.55137674, 2.48975299, -0.67719333, 3.36915713],
[-1.68472504, 0.25365122, -2.91329511, 1.13305535],
[-1.60114037, 0.33723588, -2.82971045, 1.21664001]])
Let's mimic this with np.kron, which has a cvxpy-equivalent:
aLifted = np.kron(np.ones((1,n)), a)
bLifted = np.kron(np.ones((m,1)), b)
aLifted
array([[ 1.62434536, 1.62434536, 1.62434536, 1.62434536],
[-0.61175641, -0.61175641, -0.61175641, -0.61175641],
[-0.52817175, -0.52817175, -0.52817175, -0.52817175]])
bLifted
array([[-1.07296862, 0.86540763, -2.3015387 , 1.74481176],
[-1.07296862, 0.86540763, -2.3015387 , 1.74481176],
[-1.07296862, 0.86540763, -2.3015387 , 1.74481176]])
aLifted + bLifted
array([[ 0.55137674, 2.48975299, -0.67719333, 3.36915713],
[-1.68472504, 0.25365122, -2.91329511, 1.13305535],
[-1.60114037, 0.33723588, -2.82971045, 1.21664001]])
Let's check cvxpy semi-blindly (we only dimensions; too lazy to setup a problem and fix variable to check the output :-D):
import cvxpy as cp
x = cp.Variable((m, 1))
y = cp.Variable((1, n))
cp.kron(np.ones((1,n)), x) + cp.kron(np.ones((m, 1)), y)
# Expression(AFFINE, UNKNOWN, (3, 4))
# looks good!
Now some caveats:
i don't know how efficient cvxpy can reason about this matrix-form internally
unclear if more efficient as a simple list-comprehension based form using cp.vstack and co (it probably is)
this operation itself kills all sparsity
(if both vectors are dense; your matrix is dense)
cvxpy and more or less all convex-optimization solvers are based on some sparsity assumption
scaling this problem up to machine-learning dimensions will not make you happy
there is probably a much more concise mathematical theory for your problem then to use (sparsity-assuming) (pretty) general (DCP implemented in cvxpy is a subset) convex-optimization

How to get interpolated array values in numpy / scipy

I was wondering how I could get the interpolated value of a 3D array. I am trying to get the value at for example position: (1.4, 2.3, 4.2) of a 3d array. How can I get the interpolated value?
counterX = 1.5
counterY = 1.5
counterZ = 1.5
for x in range(0, length)
for y in range(0, length)
for z in range(0, length)
value = img[counterX, counterY, counterZ]
counterZ = 0
counterY = 0
counterX, counterY and counterZ are float values rather than integers. However I cannot css them int(...) since my results need to be very exact. Therefore I thought interpolation would be the best solution.
Just go for trilinear Interpolation as described here:
https://en.wikipedia.org/wiki/Trilinear_interpolation
For your example this would be:
C00 = (1,2,4)*0.6 + (2,2,4)*0.4
C01 = (1,3,4)*0.6 + (2,3,4)*0.4
C10 = (1,2,5)*0.6 + (2,2,5)*0.4
C11 = (1,3,5)*0.6 + (2,3,5)*0.4
C0 = C00*0.8 + C10*0.2
C1 = C01*0.8 + C11*0.2
C = C0*0.7 + C1*0.3
I am not sure what is exactly your problem.
Would you like to create an interpolated array from some observed values ? Then I would personnally recommend to use a kriging model, pyKriging seems to do that but I never used it personnally.
Then you could create a function (using the prediction model built through kriging) taking 3 arguments counterX, counterY and counterZ and just evaluate the prediction in any positions.

Calculating medoid of a cluster (Python)

So I'm running a KNN in order to create clusters. From each cluster, I would like to obtain the medoid of the cluster.
I'm employing a fractional distance metric in order to calculate distances:
where d is the number of dimensions, the first data point's coordinates are x^i, the second data point's coordinates are y^i, and f is an arbitrary number between 0 and 1
I would then calculate the medoid as:
where S is the set of datapoints, and δ is the absolute value of the distance metric used above.
I've looked online to no avail trying to find implementations of medoid (even with other distance metrics, but most thing were specifically k-means or k-medoid which [I think] is relatively different from what I want.
Essentially this boils down to me being unable to translate the math into effective programming. Any help would or pointers in the right direction would be much appreciated! Here's a short list of what I have so far:
I have figured out how to calculate the fractional distance metric (the first equation) so I think I'm good there.
I know numpy has an argmin() function (documented here).
Extra points for increased efficiency without lack of accuracy (I'm trying not to brute force by calculating every single fractional distance metric (because the number of point pairs might lead to a factorial complexity...).
compute pairwise distance matrix
compute column or row sum
argmin to find medoid index
i.e. numpy.argmin(distMatrix.sum(axis=0)) or similar.
So I've accepted the answer here, but I thought I'd provide my implementation if anyone else was trying to do something similar:
(1) This is the distance function:
def fractional(p_coord_array, q_coord_array):
# f is an arbitrary value, but must be greater than zero and
# less than one. In this case, I used 3/10. I took advantage
# of the difference of cubes in this case, so that I wouldn't
# encounter an overflow error.
a = np.sum(np.array(p_coord_array, dtype=np.float64))
b = np.sum(np.array(q_coord_array, dtype=np.float64))
a2 = np.sum(np.power(p_coord_array, 2))
ab = np.sum(p_coord_array) * np.sum(q_coord_array)
b2 = np.sum(np.power(p_coord_array, 2))
diffab = a - b
suma2abb2 = a2 + ab + b2
temp_dist = abs(diffab * suma2abb2)
temp_dist = np.power(temp_dist, 1./10)
dist = np.power(temp_dist, 10./3)
return dist
(2) The medoid function (if the length of the dataset was less than 6000 [if greater than that, I ran into overflow errors... I'm still working on that bit to be perfectly honest...]):
def medoid(dataset):
point = []
w = len(dataset)
if(len(dataset) < 6000):
h = len(dataset)
dist_matrix = [[0 for x in range(w)] for y in range(h)]
list_combinations = [(counter_1, counter_2, data_1, data_2) for counter_1, data_1 in enumerate(dataset) for counter_2, data_2 in enumerate(dataset) if counter_1 < counter_2]
for counter_3, tuple in enumerate(list_combinations):
temp_dist = fractional(tuple[2], tuple[3])
dist_matrix[tuple[0]][tuple[1]] = abs(temp_dist)
dist_matrix[tuple[1]][tuple[0]] = abs(temp_dist)
Any questions, feel free to comment!
If you don't mind using brute force this might help:
def calc_medoid(X, Y, f=2):
n = len(X)
m = len(Y)
dist_mat = np.zeros((m, n))
# compute distance matrix
for j in range(n):
center = X[j, :]
for i in range(m):
if i != j:
dist_mat[i, j] = np.linalg.norm(Y[i, :] - center, ord=f)
medoid_id = np.argmin(dist_mat.sum(axis=0)) # sum over y
return medoid_id, X[medoid_id, :]
Here is an example of computing a medoid for a single cluster with Euclidean distance.
import numpy as np, pandas as pd, matplotlib.pyplot as plt
a, b, c, d = np.array([0,1]), np.array([1, 3]), np.array([4,2]), np.array([3, 1.5])
vCenroid = np.mean([a, b, c, d], axis=0)
def GetMedoid(vX):
vMean = np.mean(vX, axis=0) # compute centroid
return vX[np.argmin([sum((x - vMean)**2) for x in vX])] # pick a point closest to centroid
vMedoid = GetMedoid([a, b, c, d])
print(f'centroid = {vCenroid}')
print(f'medoid = {vMedoid}')
df = pd.DataFrame([a, b, c, d], columns=['x', 'y'])
ax = df.plot.scatter('x', 'y', grid=True, title='Centroid in 2D plane', s=100);
plt.plot(vCenroid[0], vCenroid[1], 'ro', ms=10); # plot centroid as red circle
plt.plot(vMedoid[0], vMedoid[1], 'rx', ms=20); # plot medoid as red star
You can also use the following package to compute medoid for one or more clusters
!pip -q install scikit-learn-extra > log
from sklearn_extra.cluster import KMedoids
GetMedoid = lambda vX: KMedoids(n_clusters=1).fit(vX).cluster_centers_
GetMedoid([a, b, c, d])[0]
I would say that you just need to compute the median.
np.median(np.asarray(points), axis=0)
Your median is the point with the biggest centrality.
Note: if you are using distances different than Euclidean this doesn't hold.

Converting MATLAB's interp1 to Python interp1d

I'm converting a MATLAB code into a Python code.
The code uses the function interp1 in MATLAB. I found that the scipy function interp1d should be what I'm after, but I'm not sure. Could you tell me if the code, I implemented is correct?
My Python version is 3.4.1, MATLAB version is R2013a. However, the code has been implemented around 2010].
MATLAB:
S_T = [0.0, 2.181716948, 4.363766232, 6.546480392, 8.730192373, ...
10.91523573, 13.10194482, 15.29065504, 17.48170299, 19.67542671, ...
21.87216588, 24.07226205, 26.27605882, 28.48390208; ...
1.0, 1.000382662968538, 1.0020234819906781, 1.0040560245904753, ...
1.0055690037530718, 1.0046180687475195, 1.000824223678225, ...
0.9954866694014762, 0.9891408937764872, 0.9822543350571298, ...
0.97480163751874, 0.9666158376141503, 0.9571711322843011, ...
0.9460998105962408; ...
1.0, 0.9992731388936672, 0.9995093132493109, 0.9997021748479805, ...
0.9982835412406582, 0.9926319477117723, 0.9833685776596993, ...
0.9730725288209638, 0.9626092685176822, 0.9525234896714959, ...
0.9426698515488858, 0.9326788630704709, 0.9218100196936996, ...
0.9095717918978693];
S = transpose(S_T);
dist = 0.00137;
old = 15.61;
ll = 125;
ref = 250;
start = 225;
high = 7500;
low = 2;
U = zeros(low,low,high);
for ii=1:high
g0= start-ref*dist*ii;
g1= g0+ll;
if(g0 <=0.0 && g1 >= 0.0)
temp= old/2*(1-cos(2*pi*g0/ll));
for jj=1:low
U(jj,jj,ii)= temp;
end
end
end
for ii=1:low
S_mod(ii,1,:)=interp1(S(:,1),S(:,ii+1),U(ii,ii,:),'linear');
end
Python:
import numpy
import os
from scipy import interpolate
S = [[0.0, 2.181716948, 4.363766232, 6.546480392, 8.730192373, 10.91523573, 13.10194482, 15.29065504, \
17.48170299, 19.67542671, 21.87216588, 24.07226205, 26.27605882, 28.48390208], \
[1.0, 1.000382662968538, 1.0020234819906781, 1.0040560245904753, 1.0055690037530718, 1.0046180687475195, \
1.000824223678225, 0.9954866694014762, 0.9891408937764872, 0.9822543350571298, 0.97480163751874, \
0.9666158376141503, 0.9571711322843011, 0.9460998105962408], \
[1.0, 0.9992731388936672, 0.9995093132493109, 0.9997021748479805, 0.9982835412406582, 0.9926319477117723, \
0.9833685776596993, 0.9730725288209638, 0.9626092685176822, 0.9525234896714959, 0.9426698515488858, \
0.9326788630704709, 0.9218100196936996, 0.9095717918978693]]
dist = 0.00137
old = 15.61
ll = 125
ref = 250
start = 225
high = 7500
low = 2
U = [numpy.zeros( [low, low] ) for _ in range(high)]
for ii in range(high):
g0 = start - ref * dist * (ii+1)
g1 = g0 + ll
if g0 <=0.0 and g1 >= 0.0:
for jj in range(low):
U[ii][jj,jj] = old / 2 * (1 - numpy.cos( 2 * numpy.pi * g0 / ll) )
S_mod = []
for jj in range(high):
temp = []
for ii in range(low):
temp.append(interpolate.interp1d( S[0], S[ii+1], U[jj][ii,ii]))
S_mod.append(temp)
Ok so I've solved my own problem (thanks to the explanation on the MATLAB interp1 from Alex!).
The python interp1d doesn't have query points in itself, but instead creates a function which you then use to get your new data points. Thus, it should be:
f = interpolate.interp1d( S[0], S[ii+1])
temp.append(f(U[jj][ii,ii]))
There is a python library that let's you use MATLAB functions through wrappers: mlabwrap. If you don't need to change the code of the functions itself this could save you some time.
I don't know scipy, but I can tell you what the interp1 call in MATLAB is doing:
http://www.mathworks.com/help/matlab/ref/interp1.html
You are using the syntax:
vq = interp1(x,v,xq,method)
"Vector x contains the sample points, and v contains the corresponding values, v(x). Vector xq contains the coordinates of the query points."
So, in your code, S(:,1) contains the sample points where your grid is defined, S(:,ii+1) contains your sampled values for your 1-D function, and U(ii,ii,:) contains the query points where you want to interpolate to find new functional values between known values in your grid. You are using linear interpolation.
1-D interpolation is an extremely well defined operation, and interp1 is a relatively straightforward interface for this operation. What exactly do you not understand? Are you clear what interpolation is?
Essentially, you have a discretely defined function f[x], the first argument to interp1 is x, the second argument is f[x], and the third argument are arbitrarily defined query points Xq at which you want to find new function values f[Xq]. Since these values are not known, you have to use an interpolation method for how you will approximate f[Xq]. 'linear' means you will use a linear weighted average of the two known sampled neighbors (left and right neighbors) nearest to Xq.

Categories

Resources