I have a panda DataFrame from which, i would like to do clustering for each columns. I am using sklearn and this is what i have:
data= pd.read_csv("data.csv")
data=pd.DataFrame(data)
data=data.set_index("Time")
#print(data)
cluster_numbers=2
list_of_cluster=[]
for k,v in data.iteritems():
temp=KMeans(n_clusters=cluster_numbers)
temp.fit(data[k])
print(k)
print("predicted",temp.predict(data[k]))
list_of_cluster.append(temp.predict(data[k]))
when i try to run it, i have this error: ValueError: n_samples=1 should be >= n_clusters=2
I am wondering what is the problem as i have more samples than number of clusters. Any help will be appreciated
The K-Means clusterer expects a 2D array, each row a data point, which can also be one-dimensional. In your case you have to reshape the pandas column to a matrix having len(data) rows and 1 column. See below an example that works:
from sklearn.cluster import KMeans
import pandas as pd
data = {'one': [1., 2., 3., 4., 3., 2., 1.], 'two': [4., 3., 2., 1., 2., 3., 4.]}
data = pd.DataFrame(data)
n_clusters = 2
for col in data.columns:
kmeans = KMeans(n_clusters=n_clusters)
X = data[col].reshape(-1, 1)
kmeans.fit(X)
print "{}: {}".format(col, kmeans.predict(X))
Related
This is a follow-up to my previous question.
Given an NxM matrix A, I want to efficiently obtain the NxN matrix whose ith row is the sum along the 2nd axis of the result of applying np.minimum between A and the ith row of A.
Using a for loop,
> A = np.array([[1, 2], [3, 4], [5,6]])
> output = np.zeros(shape=(A.shape[0], A.shape[0]))
> for i in range(A.shape[0]):
output[i] = np.sum(np.minimum(A, A[i]), axis=1)
> output
array([[ 3., 3., 3.],
[ 3., 7., 7.],
[ 3., 7., 11.]])
Is is possible to optimize this further without the for loop?
Edit: I would also like to do it without allocating an MxMxN tensor because of memory constraints.
instead of a for loop. Using the NumPy minimum and sum functions, you can compute the desired matrix output as follows:
output = np.sum(np.minimum(A[:, None], A), axis=2)
I am working with a dataset of mixed categorical and numeric variables. There is lots of missing data and as such, I am hoping to do some imputation through classifiers. I am currently using fast_knn from impyute.imputation.cs. fast_knn is an easy to use function that fills in missing values with a kNN model.
My hope is to pass a numpy array into fast_knn that contains one hot encodings for the categorical variables, with np.nan in place for the values that are missing, mixed with the data from numeric attributes (also with np.nan in place for values that are missing).
The difficulty is making sure the missing values are apparent after converting categorical data to one hot encodings. How can I convert categorical data to one hot encodings such that missing values result in np.nan (as opposed to a one hot encoding)? I have been struggling with this for some time embarrassingly — I was under the impression that OneHotEncoder from scikit places 0-filled arrays for missing values, but I don't believe this is correct.
I would like to use a throwaway example. Suppose I had a dataset with three features, two categorical and one numeric. Here is an example of the final structure I would like. The first two features are categorical and the third is numeric:
#np.nan is in place for any missing value.
[
[[0, 0], [0, 1], [1, 0], [1, 1], np.nan],
[[0, 0, 0], [0, 0, 1], [1, 0, 1], [1, 1, 1], np.nan] #Suppose this category has 8 possible values the attribute can take on.
[1, 3, np.nan, 3, 5]
]
fast_knn would impute wherever there is np.nan.
I hope my question is clear. Keep in mind that the categorical subset is quite large — 145000 rows x 5 columns. It would be good to not do something computationally expensive. I am hoping for a technique besides designating missing values as another kind of value a categorical attribute can take on and then iterating through the one hot encodings to change it back to np.nan.
1. One Hot Encoder (np.nan for unknown values not supported)
If you want to go with the one hot encoding approach, OneHotEncoder does indeed set a zero array for unknown values, consider for example
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
enc = OneHotEncoder(handle_unknown='ignore', sparse=False)
s = pd.Series(list('abca'))
enc.fit(s.values.reshape(-1, 1))
t = enc.transform(np.array(['a', 'c', 'Other', 'b', 'Another']).reshape(-1, 1))
t
>>>
array([[1., 0., 0.],
[0., 0., 1.],
[0., 0., 0.],
[0., 1., 0.],
[0., 0., 0.]])
the unknown categories Other and Another are zero arrays. To replace all zero arrays in t
zero_cond = (t == 0).all(axis=1)
t[zero_cond] = np.nan
t
>>>
array([[ 1., 0., 0.],
[ 0., 0., 1.],
[nan, nan, nan],
[ 0., 1., 0.],
[nan, nan, nan]])
which you can now pass to the imputer.
2. Ordinal Encoder (unknown values can be set to np.nan)
Another option to handle your categorical variables that sets Nan for unknown variables is OrdinalEncoder (since scikit-learn version 0.24)
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan)
enc.fit(s.values.reshape(-1, 1))
enc.transform(np.array(['a', 'c', 'Other', 'b', 'Another']).reshape(-1, 1))
>>>
array([[ 0.],
[ 2.],
[nan],
[ 1.],
[nan]])
I have the following question. Is there somekind of method with numpy or scipy , which I can use to get an given unsorted array like this
a = np.array([0,0,1,1,4,4,4,4,5,1891,7]) #could be any number here
to something where the numbers are interpolated/mapped , there is no gap between the values and they are in the same order like before?:
[0,0,1,1,2,2,2,2,3,5,4]
EDIT
Is it furthermore possible to swap/shuffle the numbers after the mapping, so that
[0,0,1,1,2,2,2,2,3,5,4]
become something like:
[0,0,3,3,5,5,5,5,4,1,2]
Edit: I'm not sure what the etiquette is here (should this be a separate answer?), but this is actually directly obtainable from np.unique.
>>> u, indices = np.unique(a, return_inverse=True)
>>> indices
array([0, 0, 1, 1, 2, 2, 2, 2, 3, 5, 4])
Original answer: This isn't too hard to do in plain python by building a dictionary of what index each value of the array would map to:
x = np.sort(np.unique(a))
index_dict = {j: i for i, j in enumerate(x)}
[index_dict[i] for i in a]
Seems you need to rank (dense) your array, in which case use scipy.stats.rankdata:
from scipy.stats import rankdata
rankdata(a, 'dense')-1
# array([ 0., 0., 1., 1., 2., 2., 2., 2., 3., 5., 4.])
I have a 2D array, and it has some duplicate columns. I would like to be able to see which unique columns there are, and where the duplicates are.
My own array is too large to put here, but here is an example:
a = np.array([[ 1., 0., 0., 0., 0.],[ 2., 0., 4., 3., 0.],])
This has the unique column vectors [1.,2.], [0.,0.], [0.,4.] and [0.,3.]. There is one duplicate: [0.,0.] appears twice.
Now I found a way to get the unique vectors and their indices here but it is not clear to me how I would get the occurences of duplicates as well. I have tried several naive ways (with np.where and list comps) but those are all very very slow. Surely there has to be a numpythonic way?
In matlab it's just the unique function but np.unique flattens arrays.
Here's a vectorized approach to give us a list of arrays as output -
ids = np.ravel_multi_index(a.astype(int),a.max(1).astype(int)+1)
sidx = ids.argsort()
sorted_ids = ids[sidx]
out = np.split(sidx,np.nonzero(sorted_ids[1:] > sorted_ids[:-1])[0]+1)
Sample run -
In [62]: a
Out[62]:
array([[ 1., 0., 0., 0., 0.],
[ 2., 0., 4., 3., 0.]])
In [63]: out
Out[63]: [array([1, 4]), array([3]), array([2]), array([0])]
The numpy_indexed package (disclaimer: I am its author) contains efficient functionality for computing these kind of things:
import numpy_indexed as npi
unique_columns = npi.unique(a, axis=1)
non_unique_column_idx = npi.multiplicity(a, axis=1) > 1
Or alternatively:
unique_columns, column_count = npi.count(a, axis=1)
duplicate_columns = unique_columns[:, column_count > 1]
For small arrays:
from collections import defaultdict
indices = defaultdict(list)
for index, column in enumerate(a.transpose()):
indices[tuple(column)].append(index)
unique = [kk for kk, vv in indices.items() if len(vv) == 1]
non_unique = {kk:vv for kk, vv in indices.items() if len(vv) != 1}
This seems like it should be straightforward, but I can't figure it out.
Data source is a two column, comma delimited input file with these contents:
6,10
5,9
8,13
...
And my code is:
import numpy as np
data = np.loadtxt("data.txt", delimiter=",")
m = len(data)
x = np.reshape(data[:,0], (m,1))
y = np.ones((m,1))
z = np.matrix([x,y])
Which gives me this error:
Users/acpigeon/.virtualenvs/ipynb/lib/python2.7/site-packages/numpy-1.9.0.dev_297f54b-py2.7-macosx-10.9-intel.egg/numpy/matrixlib/defmatrix.pyc in __new__(subtype, data, dtype, copy)
270 shape = arr.shape
271 if (ndim > 2):
--> 272 raise ValueError("matrix must be 2-dimensional")
273 elif ndim == 0:
274 shape = (1, 1)
ValueError: matrix must be 2-dimensional
No amount of reshaping seems to get this to work, so I'm either missing something really simple or there's a better way to do this.
EDIT:
Would have been helpful to specify the output I am looking for. Here is a line of code that generates the desired result:
In [1]: np.matrix([[5,1],[6,1],[8,1]])
Out[1]:
matrix([[5, 1],
[6, 1],
[8, 1]])
The desired output can be generated this way:
In [12]: np.array((data[:, 0], np.ones(m))).transpose()
Out[12]:
array([[ 6., 1.],
[ 5., 1.],
[ 8., 1.]])
The above is copied from ipython and so has ipython style prompts.
Answer to previous version
To eliminate the error, replace:
x = np.reshape(data[:, 0], (m, 1))
with:
x = data[:, 0]
The former line produces a 2-dimensional matrix and that is what causes the error message. The latter produces a 1-D array with the same data.
Or how about first turning the array into a matrix, and then change the last column to 1?
In [2]: data=np.loadtxt('stack23859379.txt',delimiter=',')
In [3]: np.matrix(data)
Out[3]:
matrix([[ 6., 10.],
[ 5., 9.],
[ 8., 13.]])
In [4]: z = np.matrix(data)
In [5]: z[:,1]=1
In [6]: z
Out[6]:
matrix([[ 6., 1.],
[ 5., 1.],
[ 8., 1.]])