I am working with two-dimensional data
X = array([[5.40310335, 0. ],
[6.86136114, 6.56225717],
[0. , 0. ],
...,
[5.88838732, 0. ],
[6.0003473 , 0. ],
[6.25971331, 0. ]])
looking for clusters, using euclidean distance, i run affinity propagation from scikit learn with this raw data as follows
af = AffinityPropagation(damping=.9, max_iter=300, random_state=0).fit(X)
obtaining as a result 9 clusters.
I understand that when you want to use another distance you have to enter the negative distance matrix, and use affintity = 'precomputed' as it follows
af_c = AffinityPropagation(damping=.9, max_iter=300,
affinity='precomputed', random_state=0).fit(distM)
if as distM I use the Euclidean distance matrix calculated as follows
distM_E = -distance_matrix(X,X)
np.fill_diagonal(distM, np.median(distM))
completing the diagonal with the median since it is a predefined preference value also in the method.
Using this I am getting 34 clusters as a result and I would expect to have 9 as if working with the default distance. I don't know if I'm interpreting the way of entering the distance matrix correctly or if the library does something different when one uses everything predefined.
I would appreciate any help.
Related
I am trying to find the cosine similarity of a list of strings. I used sklearn tfidf vector to convert the text into a numerical vector first and then used the pairwise cosine_similarity api to find the score for each string pair.
The strings seem similar, but I am getting a weird answer. The first and third value in the string array are similar except the word TRENTON, but the cosine similarity is 0. Similarly, the 1st,3rd and 4th string are the same, except for a space between GREEN and CHILLI and the cosine similarity is zero. Isn't that strange?
My code:
from sklearn.metrics import pairwise_kernels
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer=TfidfVectorizer()
values =['GREENCHILLI TRENTON'
,'GREENCHILLI'
,'GREEN CHILLI'
,'GREEN CHILLI']
X_train_counts = tfidf_vectorizer.fit_transform(values)
similarities = cosine_similarity(X_train_counts)
print(similarities)
Output
[[1. 0.6191303 0. 0. ]
[0.6191303 1. 0. 0. ]
[0. 0. 1. 1. ]
[0. 0. 1. 1. ]]
coma (,) missing between last two GREEN CHILLI so tfidf is treating them as only 3 records not 4.
If you correct it you should see below cosine similarity
[[1. 0.6191303 0. 0. ]
[0.6191303 1. 0. 0. ]
[0. 0. 1. 1. ]
[0. 0. 1. 1. ]]
How to interpret the above matrix: The value in the nth row are cosine similarities of that tfidf vector with all other vectors (in sequential order). So all the diagonal will be 1 because every vector is similar to itself.
The first and third value in the string array values is similar except the word Trenton but cosine similarity is 0.
Similarly, 1st,3rd and 4th strings are same only space between GREEN and CHILLI and the cosine similarity is zero. isn't it strange?
It is not as strange as you might think. You will only get a non-zero cosine similarity if you have exact word matches between the strings that you compare. I will try to explain what happens:
When the TF-IDF vectorizer creates vectors from your list of strings, it starts by making a list of all words that occur.
So in your case, the list would look like this:
GREENCHILLI
TRENTON
GREEN
CHILLI
Now, every word becomes an axis in a coordinate system that the algorithm uses. All axes are perpendicular to each other.
So when you compare 'GREENCHILLI TRENTON' with 'GREEN CHILLI', the algorithm makes two vectors. One from 'GREENCHILLI TRENTON' that has a component parallel to 'GREENCHILLI' and a component parallel to 'TRENTON'. The vector from the string 'GREEN CHILI' has components in 'GREEN' and 'CHILLI' direction of your coordinate system. When you calculate the dot product between the two you will get a zero. So the cosine similarity is zero as well.
So the gap in 'GREEN CHILLI' makes all the difference, when you compare it to 'GREENCHILLI'. The letters don't matter anymore, once the vectorizer made its coordinate system based on all the words it found in your list, because it identifies 'GREENCHILLI', 'GREEN' and 'CHILLI' as different words and makes them into perpendicular axes in its reference coordinate system.
Hope that makes it more clear. I suggest reading the following article series for a more in-depth understanding of whats going on:
http://blog.christianperone.com/2011/09/machine-learning-text-feature-extraction-tf-idf-part-i/
I have created a weighted graph of k-Neighbors using scikit-learn, I'm wondering if there is any way to plot it as a graph.
Here is the result of computation in form of array which I want to plot:
array([[0. , 2.08243189, 0. , 3.42661108],
[2.08243189, 0. , 3.27141008, 0. ],
[0. , 3.27141008, 0. , 1.57294787],
[0. , 3.29779083, 1.57294787, 0. ]])
I just need to get some visualization of data, that's all I need.
More details about the array:
Each row represents a node and each column represents the weight of connectivity of that node with the other nodes.
For example: second column of first row (2.08243189) is the weight of connectivity from first node to second node.
Another example: second row, second column (0): the weight of connectivity from node 2 to itself.
The numbers represents euclidean distance.
Are you talking about something simple like this where the size of the point gives a visual indication of the relative weight compared to the other values? Assume the array is named ar:
for i in range(len(ar)):
for j in range(len(ar)):
v = ar[i,j]
plt.scatter(i+1,j+1,lw=0,s=10**v)
plt.grid(True)
plt.xlabel('Row')
plt.ylabel('Column')
ticks = list(range(1,1+len(ar)))
plt.xticks(ticks)
plt.yticks(ticks)
I read about using RPCA to find outliers on time series data. I have an idea about the fundamentals of what RPCA is about and the theory. I got a Python library that does RPCA and pretty much got two matrices as the output (L and S), a low rank approximation of the input data and a sparse matrix.
Input data:(rows being a day and 10 features as columns.)
DAY 1 - 100,300,345,126,289,387,278,433,189,153
DAY 2 - 300,647,245,426,889,987,278,133,295,153
DAY 3 - 200,747,145,226,489,287,378,1033,295,453
Output obtained :
L
[[ 125.20560531 292.91525518 92.76132814 141.33797061 282.93586313
185.71134917 199.48789246 96.04089205 192.11501055 118.68811072]
[ 174.72737183 408.77013914 129.45061871 197.24046765 394.84366245
259.16456278 278.39005349 134.0273274 268.1010231 165.63205458]
[ 194.38951303 454.76920678 144.01774873 219.43601655 439.27557808
288.32845493 309.71739782 149.10947628 298.27053871 184.27069609]]
S
[[ -25.20560531 0. 252.23867186 -0. 0.
201.28865083 78.51210754 336.95910795 -0. 34.31188928]
[ 125.27262817 238.22986086 115.54938129 228.75953235 494.15633755
727.83543722 -0. -0. 26.8989769 -0. ]
[ 0. 292.23079322 -0. 0. 49.72442192
-0. 68.28260218 883.89052372 0. 268.72930391]]
Inference: (My question)
Now how do I infer the points that could be classified as outliers. For ex. by looking at the data, we could say 1033 looks like an outlier. The corresponding entry in S matrix is 883.89052372 which is more compared to other entries in S. Could the notion of having a fixed threshold to find the deviations of S matrix entries from the corresponding original value in the input matrix be used to determine that the point is an outlier ? Or am I completely understanding the concept of RPCA wrong ? TIA for your help.
You understood the concept of robust PCA (RPCA) correctly: The sparse matrix S contains the outliers.
However, S will often contain many observations (non-zero values) you might not classify as anomalies yourself. As you suggest it is therefore a good idea to filter out these points.
Applying a fixed threshold to identify relevant outliers could potentially work for one dataset. However, using the threshold on many datasets might give poor results if there are changes in mean and variance of the underlying distribution.
Ideally you calculate an anomaly score and then classify the outliers based on that score. A simple method (and often used in outlier detection) is to see if your data point (potential outlier) is at the tail of your assumed distribution.
For example, if you assume your distribution is Gaussian you can calculate the Z-score (z):
z = (x-μ)/σ,
where μ is the mean and σ is the standard deviation.
You can then apply a threshold to the calculated Z-score in order to identify an outlier. For example: if for a given observation z > 3, the data point is an outlier. This means your observation is more than 3 standard deviations from the mean and it is in the 0.1% tail of the Gaussian distribution. This approach is more robust to changes in the data than using a threshold on the non-standardized values. Furthermore tuning the z value at which you classify the outlier is simpler than finding a real scale value (883.89052372 in your case) for each dataset.
I have computed a jaccard similarity matrix with Python. I want to cluster highest similarities to lowest, however, no matter what linkage function I use it produces the same dendrogram! I have a feeling that the function assumes that my matrix is of original data, but I have already computed the first similarity matrix. Is there any way to pass this similarity matrix through to the dendrogram so it plots correctly? Or am I going to have to output the matrix and simply do it with R. Passing through the original raw data is not possible, as I am computing similarities of words. Thanks for the help!
Here is some code:
SimMatrix = [[ 0.,0.09259259, 0.125 , 0. , 0.08571429],
[ 0.09259259, 0. , 0.05555556, 0. , 0.05128205],
[ 0.125 , 0.05555556, 0. , 0.03571429, 0.05882353],
[ 0. , 0. , 0.03571429, 0. , 0. ],
[ 0.08571429, 0.05128205, 0.05882353, 0. , 0. ]]
linkage = hcluster.complete(SimMatrix) #doesnt matter what linkage...
dendro = hcluster.dendrogram(linkage) #same plot for all types?
show()
If you run this code, you will see a dendrogram that is completely backwards. No matter what linkage type I use, it produces the same dendrogram. This intuitively can not be correct!
Here's the solution. Turns out the SimMatrix needs to be first converted into a condensed matrix (the diagonal, upper right or bottom left, of this matrix).
You can see this in the code below:
import scipy.spatial.distance as ssd
distVec = ssd.squareform(SimMatrix)
linkage = hcluster.linkage(1 - distVec)
dendro = hcluster.dendrogram(linkage)
show()
There are a couple of ball-bounce related questions on stackoverflow that i've looked through, however none of them seem to get me past my predicament. I have a turtle cursor defined by a transformation matrix that intersects a line in 3d space. What I want is to rotate the cursor, that is, the transformation matrix, at the point of intersection so that it's new direction matches the reflection vector. I have functions that will get both the reflection vector R from the incident vector V and the normal of the reflecting line N. I normalize each before evaluating:
N,V=unit_vector(N),unit_vector(V)
R = -2*(np.dot(V,N))*N - V
R=unit_vector(R)
My transformation matrix, T is in a numpy array:
array([[ -0.84923515, -0.6 , 0. , 3.65341878],
[ 0.52801483, -0.84923515, 0. , 25.12882224],
[ 0. , 0. , 1. , 0. ],
[ 0. , 0. , 0. , 1. ]])
How can I transform T by R to get the correct direction vector? I've found and used the R2_vect function from here to get a rotation matrix from one vector to another but only a few of the resulting reflections appear correct when i send them to vtk to render. I'm asking about this here because I seem to be reaching the limit of what I can remember from my already shaky linear algebra. Thanks for any information.
A little extra research clarified things: the first 3 columns of the transformation matrix represent 3 orthonormal vectors ( x1, x2, x3 ) and the 4th column represents the coordinates in space of the cursor at given time interval. the final row contains no data, it's just there to keep the matrix square. rotating the vectors was just a matter of removing the last row of T, taking the 3x3 rotation matrix from my listed function R and rotating each vector: R.dot(x1), R.dot(x2), R.dot(x3) Then I just had to composite the values back into a 4x4 matrix.