Related
I am trying to plot my data with a colours related to the clusters as shown below:
However, when I write this code it shows as below:
model = KMeans(n_clusters = 2)
model.fit(projected_data)
labels = model.predict(projected_data)
plt.scatter(projected_data[0],projected_data[1],c='red')
plt.show()
Looking online, I found changing the c='red' to c=labels would fix the problem, but when ever I change the code to plt.scatter(projected_data[0],projected_data[1],c=labels) it gives me this error:
'c' argument has 2 elements, which is inconsistent with 'x' and 'y' with size 6.
How can I make the colours change dynamically (Not having to type an array of strings 6 times like c=['red','blue'...]) to get a colour for each cluster?
In case you need it to test it yourself, projected_data variable equals
[[ 4 4 -6 3 1 -5]
[ 0 -3 2 -1 5 -4]]
It's expecting a list. You can see an example here where we use kmeans.labels_ as the color parameter. You could change the color palette or map the cluster labels to strings of the color you want.
from sklearn.cluster import KMeans
import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0],
[10, 2], [10, 4], [10, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
x,y = X.T
print(kmeans.labels_)
plt.scatter(x,y, c=kmeans.labels_);
Output
[1 1 1 0 0 0]
For example I have these numpy arrays:
import pandas as pd
import numpy as np
# points could be in n dimension, i need a solution that would cover that up
# and being able to calculate distance between points so flattening the data
# is not my goal.
points = np.array([[1, 2], [2, 1], [100, 100], [-2, -1], [0, 0], [-1, -2]]) # a 2d numpy array containing points in space
labels = np.array([0, 1, 1, 1, 0, 0]) # the labels of the points (not necessarily only 0 and 1)
I tried to make a dictionary and from that to create the pandas datafram:
my_dict = {'point': points, 'label': labels}
df = pd.DataFrame(my_dict, columns=['point', 'label'])
But it didn't work and I got the following exception:
Exception: Data must be 1-dimensional
Probably it's because of the numpy array of points (a 2d numpy array).
The desired result:
point label
0 [1, 2] 0
1 [2, 1] 1
2 [100, 100] 1
3 [-2, -1] 0
4 [0, 0] 0
5 [-1, -2] 1
Thanks in advance for all the helpers :)
You should always try to normalize your data such that each column only contains singular values, not data with a dimension.
In this case, I would do something like this:
>>> df = pd.DataFrame({'x': points[:,0], 'y': points[:, 1], 'label': labels},
columns=['x', 'y', 'label'])
>>> df
x y label
0 1 2 0
1 2 1 1
2 100 100 1
3 -2 -1 1
4 0 0 0
5 -1 -2 0
If you truly insist with keeping points as such, transform them to a list of lists or list of tuples before passing to pandas to avoid this error.
This code : runs k-means algorithm from scikit-learn package :
from sklearn.cluster import KMeans
import numpy as np
from matplotlib import pyplot
X = np.array([[10, 2 , 9], [1, 4 , 3], [1, 0 , 3],
[4, 2 , 1], [4, 4 , 7], [4, 0 , 5], [4, 6 , 3],[4, 1 , 7],[5, 2 , 3],[6, 3 , 3],[7, 4 , 13]])
kmeans = KMeans(n_clusters=3, random_state=0).fit(X)
k = 3
kmeans.fit(X)
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
for i in range(k):
# select only data observations with cluster label == i
ds = X[np.where(labels==i)]
# plot the data observations
pyplot.plot(ds[:,0],ds[:,1],'o')
# plot the centroids
lines = pyplot.plot(centroids[i,0],centroids[i,1],'kx')
# make the centroid x's bigger
pyplot.setp(lines,ms=15.0)
pyplot.setp(lines,mew=2.0)
pyplot.show()
generates :
As I've not set the x and y axis labels what do these axis values represent ?
scikit-learn utilizes the Euclidian distance measure for computing the distance between each point, so are the axis values representative of the Euclidean distances ?
The doc http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html does not describe this scenario.
Update : it does appear to be just plotting first two two dimension in array as using
X = np.array([[10, 2 , 90], [1, 4 , 35], [1, 0 , 30],
[4, 2 , 1], [4, 4 , 7], [4, 0 , 5], [4, 6 , 3],[4, 1 , 7],[5, 2 , 3],[6, 3 , 3],[7, 4 , 13]])
I've updated the 3'rd dimensions for first 3 parameters to : to 90 , 35 & 40 . This does not have any impact on resultant plot. So in order to visualize dimensions > 2 I should run a PCA analysis on the data.
TL;DR
I think it's simply plotting your first variable on the "x", and your second variable on the "y".
(But "x" and "y" are the wrong terms.)
Detail
In machine learning, the terms x and y are usually used a bit differently. In your case your X matrix contains data points with 3 values:
The first two values are usually called x1 and x2 variables (x with 1 subscript, if I could format it that way).
And the third value is ... I'm not sure yet. I don't see it on the plot.
If you look at your original data in X, you see [10, 2, 9], [1, 4, 3], ...
The first two variables of the first data point are (10, 2).
You can see a point plotted at horizontal 10, vertical 2.
There is a second point plotted at horizontal 1, vertical 4.
And so on ...
So from that you can basically see that the horizontal axis is x1, and the vertical is x2.
I don't know how the third value appears on the plot. It's possible that it's the color, but usually in k-means, the color is used to separate the different values into clusters. So each color is a cluster.
So I don't really see where the third value is. But that wasn't your question! :)
You probably want the documentation for pyplot, not for scikit-learn. Here is pyplot: http://matplotlib.org/api/pyplot_api.html
The following octave code shows a sample 3D matrix using Octave/Matlab
octave:1> A=zeros(3,3,3);
octave:2>
octave:2> A(:,:,1)= [[1 2 3];[4 5 6];[7 8 9]];
octave:3>
octave:3> A(:,:,2)= [[11 22 33];[44 55 66];[77 88 99]];
octave:4>
octave:4> A(:,:,3)= [[111 222 333];[444 555 666];[777 888 999]];
octave:5>
octave:5>
octave:5> A
A =
ans(:,:,1) =
1 2 3
4 5 6
7 8 9
ans(:,:,2) =
11 22 33
44 55 66
77 88 99
ans(:,:,3) =
111 222 333
444 555 666
777 888 999
octave:6> A(1,3,2)
ans = 33
And I need to convert the same matrix using numpy ... unfortunately When I'm trying to access the same index using array in numpy I get different values as shown below!!
import numpy as np
array = np.array([[[1 ,2 ,3],[4 ,5 ,6],[7 ,8 ,9]], [[11 ,22 ,33],[44 ,55 ,66],[77 ,88 ,99]], [[111 ,222 ,333],[444 ,555 ,666],[777 ,888 ,999]]])
>>> array[0,2,1]
8
Also I read the following document that shows the difference between matrix implementation in Matlab and in Python numpy Numpy for Matlab users but I didn't find a sample 3d array and the mapping of it into Matlab and vice versa!
the answer is different for example accessing the element(1,3,2) in Matlab doesn't match the same index using numpy (0,2,1)
Octave/Matlab
octave:6> A(1,3,2)
ans = 33
Python
>>> array[0,2,1]
8
The way your array is constructed in numpy is different than it is in MATLAB.
Where your MATLAB array is (y, x, z), your numpy array is (z, y, x). Your 3d numpy array is a series of 'stacked' 2d arrays, so you're indexing "outside->inside" (for lack of a better term). Here's your array definition expanded so this (hopefully) makes a little more sense:
[[[1, 2, 3],
[4, 5, 6], # Z = 0
[7 ,8 ,9]],
[[11 ,22 ,33],
[44 ,55 ,66], # Z = 1
[77 ,88 ,99]],
[[111 ,222 ,333],
[444 ,555 ,666], # Z = 2
[777 ,888 ,999]]
]
So with:
import numpy as np
A = np.array([[[1 ,2 ,3],[4 ,5 ,6],[7 ,8 ,9]], [[11 ,22 ,33],[44 ,55 ,66],[77 ,88 ,99]], [[111 ,222 ,333],[444 ,555 ,666],[777 ,888 ,999]]])
B = A[1, 0, 2]
B returns 33, as expected.
If you want a less mind-bending way to indexing your array, consider generating it as you did in MATLAB.
MATLAB and Python index differently. To investigate this, lets create a linear array of number 1 to 8 and then reshape the result to be a 2-by-2-by-2 matrix in each language:
MATLAB:
M_flat = 1:8
M = reshape(M_flat, [2,2,2])
which returns
M =
ans(:,:,1) =
1 3
2 4
ans(:,:,2) =
5 7
6 8
Python:
import numpy as np
P_flat = np.array(range(1,9))
P = np.reshape(P, [2,2,2])
which returns
array([[[1, 2],
[3, 4]],
[[5, 6],
[7, 8]]])
The first thing you should notice is that the first two dimensions have switched. This is because MATLAB uses column-major indexing which means we count down the columns first whereas Python use row-major indexing and hence it counts across the rows first.
Now let's try indexing them. So let's try slicing along the different dimensions. In MATLAB, I know to get a slice out of the third dimension I can do
M(:,:,1)
ans =
1 3
2 4
Now let's try the same in Python
P[:,:,0]
array([[1, 3],
[5, 7]])
So that's completely different. To get the MATLAB 'equivalent' we need to go
P[0,:,:]
array([[1, 2],
[3, 4]])
Now this returns the transpose of the MATLAB version which is to be expected due the the row-major vs column-major difference.
So what does this mean for indexing? It looks like Python puts the major index at the end which is the reverse of MALTAB.
Let's say I index as follows in MATLAB
M(1,2,2)
ans =
7
now to get the 7 from Python we should go
P(1,1,0)
which is the MATLAB syntax reversed. Note that is is reversed because we created the Python matrix with a row-major ordering in mind. If you create it as you did in your code you would have to swap the last 2 indices so rather create the matrix correctly in the first place as Ander has suggested in the comments.
I think better than just calling the difference "row major" or "column major" is numpy's way of describing them:
‘C’ means to read / write the elements using C-like index order, with the last axis index changing fastest, back to the first axis index changing slowest. ‘F’ means to read / write the elements using Fortran-like index order, with the first index changing fastest, and the last index changing slowest.
Some gifs to illustrate the difference: The first is row-major (python / c), second is column-major (MATLAB/ Fortran)
I think that the problem is the way you create the matrix in numpy and also the different representation of matlab and numpy, why you don't use the same system in matlab and numpy
>>> A = np.zeros((3,3,3),dtype=int)
>>> A
array([[[0, 0, 0],
[0, 0, 0],
[0, 0, 0]],
[[0, 0, 0],
[0, 0, 0],
[0, 0, 0]],
[[0, 0, 0],
[0, 0, 0],
[0, 0, 0]]])
>>> A[:,:,0] = np.array([[1,2,3],[4,5,6],[7,8,9]])
>>> A[:,:,1] = np.array([[11,22,33],[44,55,66],[77,88,99]])
>>> A[:,:,2] = np.array([[111,222,333],[444,555,666],[777,888,999]])
>>> A
array([[[ 1, 11, 111],
[ 2, 22, 222],
[ 3, 33, 333]],
[[ 4, 44, 444],
[ 5, 55, 555],
[ 6, 66, 666]],
[[ 7, 77, 777],
[ 8, 88, 888],
[ 9, 99, 999]]])
>>> A[0,2,1]
33
I think that python uses this type of indexing to create arrays as shown in the following figure:
https://www.google.com.eg/search?q=python+indexing+arrays+numpy&biw=1555&bih=805&source=lnms&tbm=isch&sa=X&ved=0ahUKEwia7b2J1qzOAhUFPBQKHXtdCBkQ_AUIBygC#imgrc=7JQu1w_4TCaAnM%3A
And, there are many ways to store your data, you can choose order='F' to count the columns first as matlab does, while the default is order='C' that count the rows first....
In the following example the cross-correlation of the A,B arrays is calculated using the cv2.matchTemplate method. The result is stored in the C array:
import cv2
import numpy as np
A=np.ones((3,3), dtype=np.uint8)
B=np.array([[1,2,3],[4,5,6],[7,8,9]], dtype=np.uint8)
C=cv2.matchTemplate( A, B, cv2.TM_CCORR )
>>> A
array([[1, 1, 1],
[1, 1, 1],
[1, 1, 1]], dtype=uint8)
>>> B
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]], dtype=uint8)
>>> C
array([[ 45.]], dtype=float32)
Let's implement the same example using scipy:
import cv2
import numpy as np
import scipy
import scipy.signal
A = np.ones((3,3), dtype=np.uint8)
B = np.array([[1,2,3],[4,5,6],[7,8,9]], dtype=np.uint8)
C = scipy.signal.correlate2d(A,B)
>>> C
array([[ 9, 17, 24, 15, 7],
[15, 28, 39, 24, 11],
[18, 33, 45, 27, 12],
[ 9, 16, 21, 12, 5],
[ 3, 5, 6, 3, 1]], dtype=uint8)
Let's now implement the same example using Octave:
octave:4> A=ones(3,3)
A =
1 1 1
1 1 1
1 1 1
octave:5> B=[1 2 3; 4 5 6; 7 8 9]
B =
1 2 3
4 5 6
7 8 9
octave:6> C=xco
xcorr xcorr2 xcov
octave:6> C=xcorr2(A,B)
C =
9 17 24 15 7
15 28 39 24 11
18 33 45 27 12
9 16 21 12 5
3 5 6 3 1
By comparing the results we can see that the opencv's method generates significantly different result.
Could someone explain the difference between the various implementations of the 2D cross-correlation?
What should I change to my opencv code in order to compute the 2D cross-correlation properly?
Thank you all,
funk
Well, to begin we need to refer to the OpenCV documentation:
Matlab/OpenCV
cv2.matchTemplate(image, templ, method[, result]) → result
result – Map of comparison results. It must be single-channel 32-bit floating-point. If image is W x H and templ is w x h , then result is (W-w+1) x (H-h+1).
With a 3x3 image and a 3x3 template, your result will be a (3-3+1)x(3-3+1) = (1x1) matrix, which is what the method actually did return.
The formula used by the TM_CCORR method is as follows:
Now let's look at the difference between this and the other implementations.
SciPy
scipy.signal.correlate2d(in1, in2, mode='full', boundary='fill', fillvalue=0)[source]
The result size is determined by the mode parameter. Using the default parameter of full means that the result size will be (W+w-1) x (H+h-1). However, changing the mode to valid will result in a (W-w+1) x (H-h+1) result, which is the same as that achieved by OpenCV.
Octave
C = xcorr2(A,B)
The size of the result matrix is:
C_rows = A_rows + B_rows - 1
C_cols = A_cols + B_cols - 1
With a 3x3 image and a 3x3 template, your result will be a (3+3-1)x(3+3-1)=(5x5) matrix.
The formula used by this method appears different than that used by OpenCV, but is actually just a different form of the same equation.
Conclusions
The formulas used in all three implementations appear to be the same. The reason for the difference between the methods is the way that boundary conditions are handled. Cross-correlation is achieved by "sliding" the template matrix over the image matrix and setting the result sum for a given cell to the sum of the products of the overlapping cells in the image and template. However, for the edge cases in the image, unless the template is a 1x1 matrix, it will overlap the edge of the image (see the picture below for an example). This case can be handled by padding or wrapping the image. In the first case, the image is enlarged and padded with zeros to ensure that the template cannot overhang the image.
In both SciPy and Octave, the default method is to pad the image, which will generate an image that is larger than the input image (indeed, in the case of two 3x3 matrices, the result is 5x5 because the template overhangs the image by a total of 2 rows and 2 columns when centered on the edge cells of the image). In OpenCV, the default method is to drop the edge cases where the template hangs over the image, which in this instance means that the only valid position for the template is centered exactly over the center of the image. This explains the single result cell with a value of 45: the sum of all elements of the template multiplied times 1.
To answer your question of how to get the same results using the Matlab implementation of OpenCV: simply enlarge the input matrix so that the size is
(W+w-1) x (H+h-1), center the image in the new matrix and pad the area outside of the image with 0's:
A=padarray(np.ones((3,3), dtype=np.uint8), [1, 1])