Parameter eps of DBSCAN, python

Parameter eps of DBSCAN, python - python

I have a set of points . Their geometry (SRID: 4326) is stored in a Database.
I have been given a code that aims to cluster this points with DBSCAN. The parameters have been set as follow: eps=1000, min_points=1.
I obtain clusters that are less distant than 1000 meters. I believed that two points less distant than 1000 meters would belong to the same cluster. Is epsilon really in meters?
The code is the following:
self.algorithm='DBSCAN'
X=self.data[:,[2,3]]
if self.debug==True:
print 'Nbr of Points: %d'% len(X)
# print X.shape
# print dist_matrix.shape
D = distance.squareform(distance.pdist(X,'euclidean'))
# print dist_matrix
# S = 1 - (D / np.max(D))
db = DBSCAN(eps, min_samples).fit(D)
self.core_samples = db.core_sample_indices_
self.labels = db.labels
the aim is not to find another way to run it but really to understand the value of eps. What it represents in term of distance. Min_sample is set to one because I accept to have clusters with a size of 1 sample indeed.

This depends on your implementation.
Your distance function could return anything; including meters, millimeters, yards, km, miles, degrees... but you did not share what function you use for computing distance!
If I'm not mistaken, SRID: 4326 does not imply anything on distance computations.
The "haversine" used by sklearn seems to use degrees, not meters.
Either way, min_points=1 is nonsensical. The query point is included, so every point itself is a cluster. With min_points <= 2, the result of DBSCAN will be a single-linkage clustering. To get a density based clustering, you need to choose a higher value to get real density.
You may want to use ELKI's DBSCAN. According to their Java sources, their distance function uses meters, but also their R*-tree index allows accelerated range queries with this distance, which will yield a substantial speed-up (O(n log n) instead of O(n^2)).

Related

Is k-means++ meant to be perfect every time? What other initialization strategies can yield the best k-means?

I've implemented a k-means algorithm and performance is highly dependent on how centroids were initialized. I'm finding random uniform initialization to give a good k-means about 5% of the time, whereas with k-means++, it's closer to 50%. Why is the yield for good k-means so low? I should disclaim I've only used a handful of data sets and my good/bad rates are indicative of only those, not broadly.
Here's an example using k-means++ where the end result was not great. The Dunn Index of this clustering is 0.16.
And an example where it worked perfectly with a Dunn Index of 0.67.
I was maybe under the naive impression k-means++ produced a good k-means every time. Is there perhaps something wrong with my code?
def initialize_centroids(points, k):
"""
Parameters:
points : a list of Points.
k : how many centroids to place.
Returns:
A list of centroids.
"""
clusters = []
clusters.append(choice(points)) # first centroid is random point
for _ in range(k - 1): # for other centroids
distances = []
for p in points:
d = inf
for c in clusters: # find the minimal distance between p and c
d = min(d, distance(p, c))
distances.append(d)
# find maximum distance index from minimal distances
clusters.append(points[distances.index(max(distances))])
return clusters
This is adapted from the algorithm as found on Wikipedia:
Choose one center uniformly at random from among the data points.
For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
Repeat Steps 2 and 3 until k centers have been chosen.
Now that the initial centers have been chosen, proceed using standard k-means clustering.
The difference is the centroids are chosen such that it is the furthest distance, not a probability to choose between furthest distances.
My intention is to compare the Dunn Index over different values of k, and empirically the Dunn Index being higher means better clustering. I can't collect (good) data if half of the time it doesn't work, so my results are skewed due to the faultiness of k-means++ or my implementation thereof.
What other initialization strategies can be employed to get a more consistent result?

Finding nearest neighbors within a distance and taking the average of those neighbors using cKDTree

I'm using python scripting to read in two large (millions of points) point clouds as arrays ("A" and "B").
I need to find the nearest "B" neighbors of the points in "A", but within 5 cm of each point in "A". I also want to average the neighbors within the 5 cm radius of the points in "A."
Is there a way to do this using cKTree all at once, including the averaging?

I'm not sure about what do you want to do, but If I understand you correctly you can follow this steps:
# this are just random arrays for test
A = 20 * np.random.rand(1000, 3)
B = 20 * np.random.rand(1000, 3)
Compute a cKDTree for each point cloud
tree_A = cKDTree(A)
tree_B = cKDTree(B)
Find the points in A that are at most at 5 units from each point in B:
# faster than loop + query_ball_point
neighbourhood = tree_B.query_ball_tree(tree_A, 5)
Compute the mean over all of those groups of points:
means = np.zeros_like(A)
for i in range(len(neighbourhood)):
means[i] = A[neighbourhood[i]].mean(0)

cKDTree does not have any units; I'm hopeful that your measurements are all in the the units (cm) as your desired manipulations.
What do you mean that you want to "average the neighbors"? Is this simply the mean location of all the neighbors within the 5-unit ball?
From what you've posted, I believe that the critical operation for you is
for A_point in A:
hood = B.query_ball_point(A_point, 5)
Now, just "average" the points in hood. I assume that you know how to do that part; cKDTree doesn't have such an operation, since SciPy and Python supply those on the base types.
You could do this with A as the first argument to query_ball_point, but then you'd get a huge list of neighbourhoods, and perhaps blow your memory limit.
Does that get you moving?

How to do calibration accounting for resolution of the instrument

I have to calibrate a distance measuring instrument which gives capacitance as output, I am able to use numpy polyfit to find a relation and apply it get distance. But I need to include limits of detection 0.0008 m as it is the resolution of the instrument.
My data is:
cal_distance = [.1 , .4 , 1, 1.5, 2, 3]
cal_capacitance = [1971, 2336, 3083, 3720, 4335, 5604]
raw_data = [3044,3040,3039,3036,3033]
I need my distance values to be like .1008, .4008 that represents the limits of detection of the instrument.
I have used the following code:
coeffs = np.polyfit(cal_capacitance, cal_distance, 1)
new_distance = []
for i in raw_data:
d = i*coeffs[0] + coeffs[1]
new_distance.append(d)
I have a csv file and actually used a pandas dataframe with date time index to store the raw data, but for simplicity I have given a list here.
I need to include the limits of detection in the calibration process to get it right.

Limit of detection is the accuracy of your measurement (the smallest 'step' you can resolve)
polyfit gives you a 'model' of the best fit function f of the relation
distance = f(capacitance)
You use 1 as the degree of the polynomial so you're basically fitting a line.
So, first off you need to look into the accuracy of the fit: this is returned by using the 3rd parameter full=True.
(see the docs: http://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html for more details)
You will get the residual of the fit.
Is it actually smaller than the LOD? Otherwise your limiting factor is the fitting
accuracy. In your particular case it looks like it is 0.00017021, so indeed below the 0.0008 LOD.
Second, why 'add' LOD to the reading? Your reading is the reading. then LOD is the +/- range the distance could really be within. Adding it to the end result does not seem to make sense here.
You should instead report the final value as 'new distance' +/- LOD.
Is your raw data all measurements of the same distance? If so, you can see that the standard deviation of this measurement using the fit is 0.0029680362423331122, ( numpy.std(new_distance) ) and range is 0.0087759439302268483, which is 10x over the LOD, so here your limiting factor really seems to be the measuring conditions.

Not to beat a dead horse, but LOD and precision are two completely different things. LOD is typically defined as three-times the standard deviation of the noise of your instrument, which would be equivalent to the minimum capacitance (or distance , which is related to capacitance here) your instrument can detect. i.e. anything less than that is equivalent to zero (more or less). But your precision is the minimum change in capacitance that can be detected by your instrument, which may or may not be less than the LOD. Such terms (in addition to accuracy) are common sources of confusion. While you may know what you are talking about when you say LOD (and everyone else may be able to understand that you really mean precision) it would be beneficial to use the proper notation. Just a thought...

Problems in performing K means clustering

I am trying to cluster the following data from a CSV file with K means clustering.
Sample1,Sample2,45
Sample1,Sample3,69
Sample1,Sample4,12
Sample2,Sample2,46
Sample2,Sample1,78
It is basically a graph where Samples are nodes and the numbers are the edges (weights).
I read the file as following:
fileopening = fopen('data.csv', 'rU')
reading = csv.reader(fileopening, delimiter=',')
L = list(reading)
I used this code: https://gist.github.com/betzerra/8744068
Here clusters are built based on the following:
num_points, dim, k, cutoff, lower, upper = 10, 2, 3, 0.5, 0, 200
points = map( lambda i: makeRandomPoint(dim, lower, upper), range(num_points) )
clusters = kmeans(points, k, cutoff)
for i,c in enumerate(clusters):
for p in c.points:
print " Cluster: ",i,"\t Point :", p
I replaced points with list L. But I got lots of errors: AttributeError, 'int' object has no attribute 'n', etc.
I need to perform K means clustering based on the third number column (edges) of my CSV file. This tutorial uses randomly creating points. But I am not sure, how to use this CSV data as an input to this k means function. How to perform k means (k=2) for my data? How can I send the CSV file data as input to this k means function?

In short "you can't".
Long answer:
K-means is defined for euclidean spaces only and it requires a valid points positions, while you only have distances between them, probably not in a strict mathematical sense but rather some kind of "similarity". K-means is not designed to work with similarity matrices.
What you can do?
You can use some other method to embeed your points in euclidean space in such a way, that they closely reasamble your distances, one of such tools is Multidimensional scaling (MDS): http://en.wikipedia.org/wiki/Multidimensional_scaling
Once point 1 is done you can run k-means
Alternatively you can also construct a kernel (valid in a Mercer's sense) by performing some kernel learning techniques to reasamble your data and then run kernel k-means on the resulting Gram matrix.

As lejlot said, only distances between points are not enough to run k-means in the classic sense. It's easy to understand if you understand the nature of k-means. On a high level, k-means works as follows:
1) Randomly assign points to cluster.
(Technically, there are more sophisticated ways of initial partitioning,
but that's not essential right now).
2) Compute centroids of the cluster.
(This is where you need the actual coordinates of the points.)
3) Reassign each point to a cluster with the closest centroid.
4) Repeat steps 2)-3) until stop condition is met.
So, as you can see, in the classic interpretation, k-means will not work, because it is unclear how to compute centroids. However, I have several suggestions of what you could do.
Suggestion 1.
Embed your points in N-dimensional space, where N is the number of points, so that the coordinates of each point are the distances to all the other points.
For example the data you showed:
Sample1,Sample2,45
Sample1,Sample3,69
Sample1,Sample4,12
Sample2,Sample2,46
Sample2,Sample1,78
becomes:
Sample1: (0,45,69,12,...)
Sample2: (78,46,0,0,...)
Then you can legitimately use Euclidean distance. Note, that the actual distances between points will not be preserved, but this could be a simple and reasonable approximation to preserve relative distances between the points. Another disadvantage is that if you have a lot of points, than your memory (and running time) requirements will be order of N^2.
Suggestion 2.
Instead of k-means, try k-medoids. For this one, you do not need the actual coordinates of the points, because instead of centroid, you need to compute medoids. Medoid of a cluster is a points from this cluster, whish has the smallest average distance to all other points in this cluster. You could look for the implementations online. Or it's actually pretty easy to implement. The running time will be proportional to N^2 as well.
Final remark.
Why do you wan to use k-means at all? Seems like you have a weighted directed graph. There are clustering algorithms specially intended for graphs. This is beyond the scope of your question, but maybe this is something that could be worth considering?

How to compute the shannon entropy and mutual information of N variables

I need to compute the mutual information, and so the shannon entropy of N variables.
I wrote a code that compute shannon entropy of certain distribution.
Let's say that I have a variable x, array of numbers.
Following the definition of shannon entropy I need to compute the probability density function normalized, so using the numpy.histogram is easy to get it.
import scipy.integrate as scint
from numpy import*
from scipy import*
def shannon_entropy(a, bins):
p,binedg= histogram(a,bins,normed=True)
p=p/len(p)
x=binedg[:-1]
g=-p*log2(p)
g[isnan(g)]=0.
return scint.simps(g,x=x)
Choosing inserting x, and carefully the bin number this function works.
But this function is very dependent on the bin number: choosing different values of this parameter I got different values.
Particularly if my input is an array of values constant:
x=[0,0,0,....,0,0,0]
the entropy of this variables obviously has to be 0, but if I choose the bin number equal to 1 I got the right answer, if I choose different values I got strange non sense (negative) answers.. what I am feeling is that numpy.histogram have the arguments normed=True or density= True that (as said in the official documentation) they should give back the histogram normalized, and probably I do some error in the moment that I swich from the probability density function (output of numpy.histogram) to the probability mass function (input of shannon entropy), I do:
p,binedg= histogram(a,bins,normed=True)
p=p/len(p)
I would like to find a way to solve these problems, I would like to have an efficient method to compute the shannon entropy independent of the bin number.
I wrote a function to compute the shannon entropy of a distribution of more variables, but I got the same error.
The code is this, where the input of the function shannon_entropydd is the array where at each position there is each variable that has to be involved in the statistical computation
def intNd(c,axes):
assert len(c.shape) == len(axes)
assert all([c.shape[i] == axes[i].shape[0] for i in range(len(axes))])
if len(axes) == 1:
return scint.simps(c,axes[0])
else:
return intNd(scint.simps(c,axes[-1]),axes[:-1])
def shannon_entropydd(c,bins=30):
hist,ax=histogramdd(c,bins,normed=True)
for i in range(len(ax)):
ax[i]=ax[i][:-1]
p=-hist*log2(hist)
p[isnan(p)]=0
return intNd(p,ax)
I need these quantities in order to be able to compute the mutual information between certain set of variables:
M_info(x,y,z)= H(x)+H(z)+H(y)- H(x,y,z)
where H(x) is the shannon entropy of the variable x
I have to find a way to compute these quantities so if some one has a completely different kind of code that works I can switch on it, I don't need to repair this code but find a right way to compute this statistical functions!

The result will depend pretty strongly on the estimated density. Can you assume a specific form for the density? You can reduce the dependence of the result on the estimate if you avoid histograms or other general-purpose estimates such as kernel density estimates. If you can give more detail about the variables involved, I can make more specific comments.
I worked with estimates of mutual information as part of the work for my dissertation [1]. There is some stuff about MI in section 8.1 and appendix F.
[1] http://riso.sourceforge.net/docs/dodier-dissertation.pdf

I think that if you choose bins = 1, you will always find an entropy of 0, as there is no "uncertainty" over the possible bin the values are in ("uncertainty" is what entropy measures). You should choose an number of bins "big enough" to account for the diversity of the values that your variable can take. If you have discrete values: for binary values, you should take such that bins >= 2. If the values that can take your variable are in {0,1,2}, you should have bins >= 3, and so on...
I must say that I did not read your code, but this works for me:
import numpy as np
x = [0,1,1,1,0,0,0,1,1,0,1,1]
bins = 10
cx = np.histogram(x, bins)[0]
def entropy(c):
c_normalized = c/float(np.sum(c))
c_normalized = c_normalized[np.nonzero(c_normalized)]
h = -sum(c_normalized * np.log(c_normalized))
return h
hx = entropy(cx)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.