a kind of kmean clustering - python

I tried to run this code in python2.7 with a matrix 20*20 and i want to get two cluster just like the kmean algorithm.
js
import numpy as np
filename = np.genfromtxt('Matrix.txt')
M = np.sort (np.random.choice (2,20))
##m = np.copy(M) => I get an error there : 'module' object is not callable
M= m #this option work better but i am not sure that it is appropriate
#initialization of the clusters
C = {}
for t in xrange(tmax=100):
#determination of clusters
J = np.mean(filename[:,M], axis = 1)
for k in range (2):
C[k] = np.where (J==k, 0,0) # np.where (J==k)=> another error for 'np.where': it take exactly three arguments but one given.I saw that it could take only one argument
#update
for k in range (2):
J = np.mean(filename[np.ix_(C[k],C[k])], axis = 1)
j= np.argmin(J)
m[k] = C[k][j] #[j] => another error for '[j]': invalid index to scalar variable
#results
print M, C
my result
{0 : 0, 1:0}
the expected result
{0:8, 1:12}
in example meaning that there is 8 elements in cluster '0' and 12 in cluster '1'.
This is probably because of 'np.where ' function but i am not sure.
I run the program without all the errors that i previously mentioned for get this result but it doesn't work as well it should
Thanks for your help

Another variant (it uses scikit library):
import numpy as np
from sklearn import cluster
n_clusters = 2
k_means = cluster.KMeans(n_clusters=n_clusters)
k_means.fit(filename)
values = k_means.cluster_centers_
labels = k_means.labels_
print values
print labels

Related

Python ML Backward Elimination deleting the wrong arrays

I tried to build the optimal model by using Backward Elimination when I am practicing machine learning course,and I think I've done something wrong in my define code,please help me to find out what is the problem.
The function should run correctly is to delete the P-values that are greater than 0.05.
I've tried my idea on console step by step,and when i tried to transform my idea into define code something went wrong.
import numpy as np
import pandas as pd
import statsmodels.formula.api as sm
# Automatic Backward Elimination
def backwardelimination(x,sl):
regressor_OLS=sm.OLS(y,x).fit()
for i in range(0,len(x[0])):
maxp=max(regressor_OLS.pvalues) #Find the max P-value
if maxp>sl: #delete the max P-value which is greater than SL
x = np.delete(x,maxp,axis=1)
regressor_OLS=sm.OLS(y,x).fit()
return x
X=np.append(arr = np.ones((50,1)).astype(int) , values = X ,axis = 1)
X_opt=X[:,[0,1,2,3,4,5]]
SL=0.05 #Significance Level
X_Modeled=backwardelimination(X_opt,SL)
This is the original matrix, I expect this result,but somehow became this.
X_opt=X[:,[0,1,2,3,4,5]] Original matrix
X_opt=X[:,[0,3]] What it should be(after delete the P-values that greater then 0.05)!!
You need to add the index of pmax to the np.delete() fun not pmax value itself, for that first convert p-values to a list then its really easy.
def backwardelimination(x, sl):
regressor_OLS = sm.OLS(y, x).fit()
for i in range(0, len(x[0])):
c = list(regressor_OLS.pvalues) # <-- list conversion
j = c.index(max(c)) # <-- getting the index of max value
maxp = max(regressor_OLS.pvalues).astype(float)
if maxp > sl:
x = np.delete(x, j, axis=1) # <-- insert j in delete as an index
regressor_OLS = sm.OLS(y, x).fit()
print(regressor_OLS.summary())
return x

Python3 - TypeError: 'numpy.float64' object is not iterable

I want to make a plot that shows the missclassification error versus de K neighbors using KNN.
This the code i've built for that:
# creating odd list of K for KNN
myList = list(range(1,50))
# subsetting just the odd ones
neighbors = filter(lambda x: x % 2 != 0, myList)
# empty list that will hold cv scores
cv_scores = []
# perform 10-fold cross validation
for k in neighbors:
knn = KNN(n_neighbors=k, n_jobs = 6, metric = 'minkowski', contamination = 0.05)
scores = cross_val_score(knn, X_test, pred, cv=10, scoring='accuracy')
cv_scores.append(scores.mean())
### Create Plot
import matplotlib.pyplot as plt
plt.style.use('ggplot')
# changing to misclassification error
MSE = [1 - x for x in cv_scores]
# determining best k
optimal_k = neighbors[MSE.index(min(next(iter(MSE))))]
print ("The optimal K neighbors number is %d" % optimal_k)
# plot misclassification error vs k
plt.plot(neighbors, MSE, figsize = (20,12))
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()
The problem is at this line:
optimal_k = neighbors[MSE.index(min(next(iter(MSE))))]
This code seemed to be written in python 2. This was the original line:
optimal_k = neighbors[MSE.index(min(MSE))]
I added next() and iter() to solve this issue, as adviced by some usersin other threads similar to this. But i'm getting this error:
TypeError: 'numpy.float64' object is not iterable
I know why this error is happening, it should be iterting through a list but it's taking only the numbers. I think the problem comes from the filter() use at this line:
neighbors = filter(lambda x: x % 2 != 0, myList)
How can i fix this code to run on python 3 and prevent this from happening??
Thanks in advance
EDIT:
The KNN version i'm using is not the one in sklearn, for those who would like to try this code. It comes from an anomaly detetction package called PYOD. Link here
You can also try it with the original KNN from sklearn, but note you will need to remove the contamination parameter and maybe the distance parameter
The issue is that the code is defining neighbors as a generator and exhausting it in the first loop. Solution: use a list.
neighbors = list(filter(lambda x: x % 2 != 0, myList))
Also your original syntax for getting the optimal was correct (no need for iter or next):
optimal_k = neighbors[MSE.index(min(MSE))]

Normal distributed sub-sampling from a numpy array in python

I have a numpy array whose values are distributed in the following manner
From this array I need to get a random sub-sample which is normally distributed.
I need to get rid of the values from the array which are above the red line in the picture. i.e. I need to get rid of some occurences of certain values from the array so that my distribution gets smoothened when the abrupt peaks are removed.
And my array's distribution should become like this:
Can this be achieved in python, without manually looking for entries corresponding to the peaks and remove some occurences of them ? Can this be done in a simpler way ?
The following kind of works, it is rather aggressive, though:
It works by ordering the samples, transforming to uniform and then trying to select a regular griddish subsample. If you feel it is too aggressive you could increase ns which is essentially the number of samples kept.
Also, please note that it requires the knowledge of the true distribution. In case of normal distribution you should be fine with using sample mean and unbiased variance estimate (the one with n-1).
Code (without plotting):
import scipy.stats as ss
import numpy as np
a = ss.norm.rvs(size=1000)
b = ss.uniform.rvs(size=1000)<0.4
a[b] += 0.1*np.sin(10*a[b])
def smooth(a, gran=25):
o = np.argsort(a)
s = ss.norm.cdf(a[o])
ns = int(gran / np.max(s[gran:] - s[:-gran]))
grid, dp = np.linspace(0, 1, ns, endpoint=False, retstep=True)
grid += dp/2
idx = np.searchsorted(s, grid)
c = np.flatnonzero(idx[1:] <= idx[:-1])
while c.size > 0:
idx[c+1] = idx[c] + 1
c = np.flatnonzero(idx[1:] <= idx[:-1])
idx = idx[:np.searchsorted(idx, len(a))]
return o[idx]
ap = a[smooth(a)]
c, b = np.histogram(a, 40)
cp, _ = np.histogram(ap, b)

Eigen vectors in python giving seemingly random element-wise signs

I'm running the following code:
import numpy as np
import matplotlib
matplotlib.use("TkAgg")
import matplotlib.pyplot as plt
N = 100
t = 1
a1 = np.full((N-1,), -t)
a2 = np.full((N,), 2*t)
Hamiltonian = np.diag(a1, -1) + np.diag(a2) + np.diag(a1, 1)
eval, evec = np.linalg.eig(Hamiltonian)
idx = eval.argsort()[::-1]
eval, evec = eval[idx], evec[:,idx]
wave2 = evec[2] / np.sum(abs(evec[2]))
prob2 = evec[2]**2 / np.sum(evec[2]**2)
_ = plt.plot(wave2)
_ = plt.plot(prob2)
plt.show()
And the plot that comes out is this:
But I'd expect the blue line to be a sinoid as well. This has got me confused and I can't find what's causing the sudden sign changes. Plotting the function absolutely shows that the values associated with each x are fine, but the signs are screwed up.
Any ideas on what might cause this or how to solve it?
Here's a modified version of your script that does what you expected. The changes are:
Corrected the indexing for the eigenvectors; they are the columns of evec.
Use np.linalg.eigh instead of np.linalg.eig. This isn't strictly necessary, but you might as well use the more efficient code.
Don't reverse the order of the sorted eigenvalues. I keep the eigenvalues sorted from lowest to highest. Because eigh returns the eigenvalues in ascending order, I just commented out the code that sorts the eigenvalues.
(Only the first change is a required correction.)
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
N = 100
t = 1
a1 = np.full((N-1,), -t)
a2 = np.full((N,), 2*t)
Hamiltonian = np.diag(a1, -1) + np.diag(a2) + np.diag(a1, 1)
eval, evec = np.linalg.eigh(Hamiltonian)
#idx = eval.argsort()[::-1]
#eval, evec = eval[idx], evec[:,idx]
k = 2
wave2 = evec[:, k] / np.sum(abs(evec[:, k]))
prob2 = evec[:, k]**2 / np.sum(evec[:, k]**2)
_ = plt.plot(wave2)
_ = plt.plot(prob2)
plt.show()
The plot:
I may be wrong, but aren't they all valid eigen vectors/values? The sign shouldn't matter, as the definition of an eigen vector is:
In linear algebra, an eigenvector or characteristic vector of a linear transformation is a non-zero vector that only changes by an overall scale when that linear transformation is applied to it.
Just because the scale is negative doesn't mean it isn't valid.
See this post about Matlab's eig that has a similar problem
One way to fix this is to simply pick a sign for the start, and multiply everthing by -1 that doesn't fit that sign (or take abs of every element and multiply by your expected sign). For your results this should work (nothing crosses 0).
Neither matlab nor numpy care about what you are trying to solve, its simple mathematics that dictates that both signed eigenvector/value combinations are valid, your values are sinusoidal, its just that there exists two sets of eigenvector/values that work (negative and positive)

numpy and pytables issue (error: tuple index out of range)

I am new to python and pytables. Currently I am writing a project about clustering and KNN algorithm. That is what I have got.
********** code *****************
import numpy.random as npr
import numpy as np
step0: obtain the cluster
dtype = np.dtype('f4')
pnts_inds = np.arange(100)
npr.shuffle(pnts_inds)
pnts_inds = pnts_inds[:10]
pnts_inds = np.sort(pnts_inds)
for i,ind in enumerate(pnts_inds):
clusters[i] = pnts_obj[ind]
step1: save the result to a HDF5 file called clst_fn.h5
filters = tables.Filters(complevel = 1, complib = 'zlib')
clst_fobj = tables.openFile('clst_fn.h5', 'w')
clst_obj = clst_fobj.createCArray(clst_fobj.root, 'clusters',
tables.Atom.from_dtype(dtype), clusters.shape,
filters = filters)
clst_obj[:] = clusters
clst_fobj.close()
step2: other function
blabla
step3: load the cluster from clst_fn
pnts_fobj= tables.openFile('clst_fn.h5','r')
for pnts in pnts_fobj.walkNodes('/', classname = 'Array'):
break
#
step4: evoke another function (called knn). The function input argument is the data from pnts. I have checked the knn function individually. This function works well if the input is pnts = npr.rand(100,128)
def knn(pnts):
pnts = numpy.ascontiguousarray(pnts)
N = ctypes.c_uint(pnts.shape[0])
D = ctypes.c_uint(pnts.shape[1])
#
evoke knn using the cluster from clst_fn (see step 3)
knn(pnts)
********** end of code *****************
My problem now is that python is giving me a hard time by showing:
error: IndexError: tuple index out of range
This error comes from
"D = ctypes.c_uint(pnts.shape[1])" this line.
Obviously, there must be something wrong with the input argument. Any thought about fixing the problem? Thank you in advance.

Categories

Resources