Related
I am working with two-dimensional data
X = array([[5.40310335, 0. ],
[6.86136114, 6.56225717],
[0. , 0. ],
...,
[5.88838732, 0. ],
[6.0003473 , 0. ],
[6.25971331, 0. ]])
looking for clusters, using euclidean distance, i run affinity propagation from scikit learn with this raw data as follows
af = AffinityPropagation(damping=.9, max_iter=300, random_state=0).fit(X)
obtaining as a result 9 clusters.
I understand that when you want to use another distance you have to enter the negative distance matrix, and use affintity = 'precomputed' as it follows
af_c = AffinityPropagation(damping=.9, max_iter=300,
affinity='precomputed', random_state=0).fit(distM)
if as distM I use the Euclidean distance matrix calculated as follows
distM_E = -distance_matrix(X,X)
np.fill_diagonal(distM, np.median(distM))
completing the diagonal with the median since it is a predefined preference value also in the method.
Using this I am getting 34 clusters as a result and I would expect to have 9 as if working with the default distance. I don't know if I'm interpreting the way of entering the distance matrix correctly or if the library does something different when one uses everything predefined.
I would appreciate any help.
A random vector sampled from the Dirichlet distribution contains values that fall in the domain [0,1] and they sum to 1. In numpy it can be programmed like this for a vector size of 5:
x = numpy.random.dirichlet(np.ones(5))
Instead, I would like a random vector that contains values that are [-1,1] and sum to 1, which I was told can be achieved by transforming the Dirichlet generated x vector as y = 2x -1
Below is an attempt at this transformation. The script doesn't work properly however because y doesn't sum to 1 as needed. How can it be fixed, or could it be that y = 2x -1 does not do what they said?
x = numpy.random.dirichlet(np.ones(5))
y = 2*x -1
print(x, np.sum(x))
print(y, np.sum(y))
which outputs:
[0.0209344 0.44791586 0.21002354 0.04107336 0.28005284] 1.0
[-0.9581312 -0.10416828 -0.57995291 -0.91785327 -0.43989433] -3.0000000000000004
The thing is that interval [0, 1] can have one and the only one linear map to interval [-1, 1] which is actualy a map x -> 2x - 1. But it can't guarantee that your sum will be stable. A reason can be seen in these observations:
np.sum(x)
0.9999999999999999
np.sum(2*x)
1.9999999999999998
np.sum(2*x-1)
-3.0
As you can see, the last sum doesn't decrease by 1 as expected. It actually decreases by 5 because every of 5 items were decreased by 1.
Try y=1/(dimension/3)-2*x. That worked for me.
As one of the answers on stats.stackexchange explains: there is only one way that you can map the variables of a Dirichlet distribution to [-1,1] while keeping the sum equal to 1. This is when the dimension is 3 and when you use y=1-2x
import numpy
numpy.random.seed(seed = 1)
x = numpy.random.dirichlet(alpha = numpy.ones(3), size = 1)
y = 1-2*x
print(x, numpy.sum(x))
print(y, numpy.sum(y))
which prints:
[[2.97492728e-01 7.02444212e-01 6.30601451e-05]] 1.0000000000000002
[[0.40501454 -0.40488842 0.99987388]] 0.9999999999999998
I have a set of pairs of numpy arrays. Each array in a pair is the same length, but arrays in different pairs have different lengths. An example of a pair of arrays from this set is:
Time: [5,8,12,17,100,121,136,156,200]
Score: [3,4,5,-10,-90,-80,-70,-40,10]
Another pair is:
Time: [6,7,9,15,199]
Score: [5,6,7,-11,-130]
I need to take an average (or perform binning) of all of these pairs based on the time. i.e. the time should be divided into intervals of 10 and the corresponding score(s) for each interval need to be averaged.
Thus, for the above 2 pairs, I want the following result:
Time: [1-10,11-20,21-30,31-40,41-50,...,191-200]
Score: [(3+4+5+6+7)/5, (5-10-11)/2, ...]
How can I do this? Is there a simpler way to do this than bin everything individually and then take the average? How do you bin an array based on the bins of another array? i.e. for an individual pair of arrays, how can I bin the time array into intervals of 10 and then use this result to bin the corresponding score array in a consistent manner?
You can use scipy.stats.binned_statistic. This is a generalization of a histogram function. A histogram divides the space into bins, and returns the count of the number of points in each bin. This function allows the computation of the sum, mean, median, or other statistic of the values (or set of values) within each bin.
from scipy import stats
import numpy as np
T1 = [5,8,12,17,100,121,136,156,200]
S1 = [3,4,5,-10,-90,-80,-70,-40,10]
T2 = [6,7,9,15,199]
S2 = [5,6,7,-11,-130]
# Merging all Times and Scores in order
Time = T1 + T2
Score = S1 + S2
output = stats.binned_statistic(Time, Score, statistic='mean',range=(0,200), bins=20)
averages = output[0]
# For empty bins, it generates NaN, we can replace them with 0
print( np.nan_to_num(averages, 0) )
# Output of this code:
# [ 5. -5.33333333 0. 0. 0.
# 0. 0. 0. 0. 0.
# -90. 0. -80. -70. 0.
# -40. 0. 0. 0. -60. ]
For more information follow this link.
I am trying to implement a custom distance metric for clustering. The code snippet looks like:
import numpy as np
from sklearn.cluster import KMeans, DBSCAN, MeanShift
def distance(x, y):
# print(x, y) -> This x and y aren't one-hot vectors and is the source of this question
match_count = 0.
for xi, yi in zip(x, y):
if float(xi) == 1. and xi == yi:
match_count += 1
return match_count
def custom_metric(x, y):
# x, y are two vectors
# distance(.,.) calculates count of elements when both xi and yi are True
return distance(x, y)
vectorized_text = np.stack([[1, 0, 0, 1] * 100,
[1, 1, 1, 0] * 100,
[0, 1, 1, 0] * 100,
[0, 0, 0, 1] * 100] * 100)
dbscan = DBSCAN(min_samples=2, metric=custom_metric, eps=3, p=1).fit(vectorized_text)
The vectorized_text is a one-hot encoded feature matrix of size n_sample x n_features. But when custom_metric is being called, one of x or y turns to be a real valued vector and other one remains the one-hot vector. Expectedly, both x and y should have been one-hot vector. This is causing the custom_metric to return wrong results during run-time and hence clustering is not as correct.
Example of x and y in distance(x, y) method:
x = [0.5 0.5 0.5 ... 0.5 0.5]
y = [0. 0. 0. 1. 0. 0. ... 1. 0.]
Both should have been one-hot vectors.
Does anyone have an idea to go about this situation?
First of all, your distance is wrong.
Distances must return small values for similar vectors. You have defined a similarity, not a distance.
Secondly, using naive python code such as zip will perform extremely poor. Python just does not optimize such code well, it will do all the work in the slow interpreter. Python speed is only okay if you vectorize everything. And in fact, this code can be vectorised trivially, and then it likely won't even matter whether your inputs are binary or float data. What you are computing in a very complicated fashion is nothing but the dot product of two vectors, isn't it?
This, your distance should probably look like this:
def distance(x, y):
return x.shape[0] - np.dot(x,y)
Or whatever distance transformation you intend to use.
Now for your actual problem: my guess is that sklearn tries to accelerate your distance with a ball tree. That won't help much because of the poor performance of Python interpreter callbacks (in fact, you should probably precompute the entire distance matrix in one vectorised operation - something like dist = dim - X.transpose().dot(X)? Do the math yourself to figure out the equation). Other languages such as Java (e.g., the ELKI tool) are much better to extend this way, because of the way the hotspot JIT compiler can optimize and inline such calls everywhere.
To test the hypothesis that the sklearn ball-tree is the cause for the odd values you are observing, try setting method="brute" or so (see the documentation) to disable the ball tree. But in the end, you'll want to either precompute the entire distance matrix (if you can afford O(n²) cost), or switch to a different programming language (implementing your distance in Cython for example helps, but you'll still likely see the data being numpy float arrays suddenly).
I don't get your question, if I have:
x = [1, 0, 1]
y = [0, 0, 1]
and I use:
def distance(x, y):
# print(x, y) -> This x and y aren't one-hot vectors and is the source of this question
match_count = 0.
for xi, yi in zip(x, y):
if float(xi) == 1. and xi == yi:
match_count += 1
return match_count
print(distance(x, y))
1.0
and on top if you print x, y now:
x
[1, 0, 1]
y
[0, 0, 1]
so it is working?
I reproduced your code and I do get your error. I explain it better here:
He has a vectorized_text variable (np.stack) which simulates a One Hot Encoded feature set (only contains 0s and 1s). And in the DBSCAN model, he uses a custom_metric function to calculate the distance. It is expected that when the model is run, the custom metric function takes as parameters pairs of observations as they are: One Hot encoded values, but instead when printing those values inside the distance function, only one is taken as it is, and the other one appears to be a list of real values as he described in the question:
x = [0.5 0.5 0.5 ... 0.5 0.5]
y = [0. 0. 0. 1. 0. 0. ... 1. 0.]
Anyway, when I pass lists to the fit parameter, the function obtains the values as they are:
from sklearn.cluster import KMeans, DBSCAN, MeanShift
x = [1, 0, 1]
y = [0, 0, 1]
feature_set = [x*5]*5
def distance(x, y):
# Printing here the values. Should be 0s and 1s
print(x, y)
match_count = 0.
for xi, yi in zip(x, y):
if float(xi) == 1. and xi == yi:
match_count += 1
return match_count
def custom_metric(x, y):
# x, y are two vectors
# distance(.,.) calculates count of elements when both xi and yi are True
return distance(x, y)
dbscan = DBSCAN(min_samples=2, metric=custom_metric, eps=3, p=1).fit(feature_set)`
Result:
[1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1.] ... [1. 0. 1. 1. 0.1. 1. 0. 1. 1. 0. 1. 1. 0. 1.]
[1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1.] ... [1. 0. 1. 1. 0.1. 1. 0. 1. 1. 0. 1. 1. 0. 1.]
I suggest you to use a pandas DataFrame or some other type of value and see if it works.
There are a couple of ball-bounce related questions on stackoverflow that i've looked through, however none of them seem to get me past my predicament. I have a turtle cursor defined by a transformation matrix that intersects a line in 3d space. What I want is to rotate the cursor, that is, the transformation matrix, at the point of intersection so that it's new direction matches the reflection vector. I have functions that will get both the reflection vector R from the incident vector V and the normal of the reflecting line N. I normalize each before evaluating:
N,V=unit_vector(N),unit_vector(V)
R = -2*(np.dot(V,N))*N - V
R=unit_vector(R)
My transformation matrix, T is in a numpy array:
array([[ -0.84923515, -0.6 , 0. , 3.65341878],
[ 0.52801483, -0.84923515, 0. , 25.12882224],
[ 0. , 0. , 1. , 0. ],
[ 0. , 0. , 0. , 1. ]])
How can I transform T by R to get the correct direction vector? I've found and used the R2_vect function from here to get a rotation matrix from one vector to another but only a few of the resulting reflections appear correct when i send them to vtk to render. I'm asking about this here because I seem to be reaching the limit of what I can remember from my already shaky linear algebra. Thanks for any information.
A little extra research clarified things: the first 3 columns of the transformation matrix represent 3 orthonormal vectors ( x1, x2, x3 ) and the 4th column represents the coordinates in space of the cursor at given time interval. the final row contains no data, it's just there to keep the matrix square. rotating the vectors was just a matter of removing the last row of T, taking the 3x3 rotation matrix from my listed function R and rotating each vector: R.dot(x1), R.dot(x2), R.dot(x3) Then I just had to composite the values back into a 4x4 matrix.