Encoding with OneHotEncoder - python

I'm trying to preprossessing data with the OneHotEncoder of scikitlearn. Obviously, I'm doing something wrong. Here is my sample program :
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
cat = ['ok', 'ko', 'maybe', 'maybe']
label_encoder = LabelEncoder()
label_encoder.fit(cat)
cat = label_encoder.transform(cat)
# returns [2 0 1 1], which seams good.
print(cat)
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
res = ct.fit_transform([cat])
print(res)
Final result : [[1.0 0 1 1]]
Expected result : something like :
[
[ 1 0 0 ]
[ 0 0 1 ]
[ 0 1 0 ]
[ 0 1 0 ]
]
Can someone point out what I'm missing ?

You can consider to using numpy and MultiLabelBinarizer.
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
cat = np.array([['ok', 'ko', 'maybe', 'maybe']])
m = MultiLabelBinarizer()
print(m.fit_transform(cat.T))
If you still want to stick with your solution. You just need to update as the following:
# because of it still a row, not a column
# res = ct.fit_transform([cat]) => remove this
# it should works
res = ct.fit_transform(np.array([cat]).T)
Out[2]:
array([[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.],
[0., 1., 0.]])

Related

How to transform this data for logistic regression?

I hava 'y' and 'X' data:
y = [1, 0, 0, 0, 0, 0, 0, 0 ...] its ok for my purpose
and
X = [['reg' '03b' '03e' 'buy']
['reg' '03b' '04e' 'sell']
['pref' '02b' '03e' 'sell']
['cur' '03b' '03e' 'buy']
['val' '03b' '03e' 'buy']
['reg' '03b' '03e' 'buy'] ...]
X[0] may take values : 'reg'/'pref'/'cur'/'val'
X[1] : string with number of mounth + b( = begin) at the end
X[2] : string with number of mounth + e( = end) at the end
X[3] : 'buy' or 'sell'
But I cant do
logreg = LogisticRegression()
logreg.fit(X,y)
Because I have troubles with structure of X (it is lists with strings)
I want to fix it and do:
logreg = preprocessing.LabelEncoder()
i=0
while i<len(X):
logreg.fit(X[i])
b[i]=logreg.transform(X[i])
i=i+1
But I get this:
[3 0 1 2]
[3 0 1 2]
[3 0 1 2]
[3 0 1 2]
[3 0 1 2]
[3 0 1 2]
...
[3 0 1 2]
All elements are the same. How can I correctly transform my data for .fit(X,y)?
The problem is that you mistake row and column in X.
import numpy as np
from sklearn import preprocessing
X = [['reg', '03b', '03e', 'buy'],
['reg', '03b', '04e', 'sell'],
['pref', '02b', '03e', 'sell'],
['cur', '03b', '03e', 'buy'],
['val', '03b', '03e', 'buy'],
['reg', '03b', '03e', 'buy']]
X = np.array(X)
b = np.zeros(X.shape)
logreg = preprocessing.LabelEncoder()
i = 0
while i < X.shape[1]:
logreg.fit(X[:,i])
b[:,i] = logreg.transform(X[:,i])
i += 1
b
array([[2., 1., 0., 0.],
[2., 1., 1., 1.],
[1., 0., 0., 1.],
[0., 1., 0., 0.],
[3., 1., 0., 0.],
[2., 1., 0., 0.]])

How to efficiently filter maximum elements of a matrix per row

Given a 2D array, I'm looking for a pythonic way to get an array of same shape, with only the maximum element per each row.
See max_row_filter function below
def max_row_filter(mat2d):
m = np.zeros(mat2d.shape)
for r in range(mat2d.shape[0]):
c = np.argmax(mat2d[r])
m[r,c]=mat2d[r,c]
return m
p = np.array([[1,2,3],[5,4,3,],[9,10,3]])
max_row_filter(p)
Out: array([[ 0., 0., 3.],
[ 5., 0., 0.],
[ 0., 10., 0.]])
I'm looking for an efficient way to do this, suitable to be done on big arrays.
Alternative answer (this will keep duplicates):
p * (p==p.max(axis=1, keepdims=True))
If there are no duplicates, you could use numpy.argmax:
import numpy as np
p = np.array([[1, 2, 3],
[5, 4, 3, ],
[9, 10, 3]])
result = np.zeros_like(p)
rows, cols = zip(*enumerate(np.argmax(p, axis=1)))
result[rows, cols] = p[rows, cols]
print(result)
Output
[[ 0 0 3]
[ 5 0 0]
[ 0 10 0]]
Note that, for multiple occurrences argmax return the first occurence.

Converting Python Dictionary to 3D Matlab Matrix

I have the following dictionary results_dict in Python 3.2 where the key field is a string value and the value field is a list of 3 arrays. Each array has 400 float values. I want to convert this dictionary into a data structure that can be used in Matlab 2017b. However, if I execute the following:
savemat('GridCellResults.mat', results_dict, oned_as='row');
The command executes successfully but Matlab is not able to understand the matrix file. For this reason, I wrote the following code to convert the previous dictionary into a 3 Dimensional Matrix (X,Y,Z) where X is the size of the array (400 Elements) and Y is the number of arrays for each dictionary key (3 Arrays) and Z is the number of elements in the dictionary. However, when I execute the code below I get the following error:
IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
Here is the code. Any clue why I am getting this error. Also even if I try without the transpose function i keep getting the same error.
import numpy as np
CARDINALITY = 400 # Number of angular domain values.
NUM_COLUMNS = 3
NUM_CELLS = 114
matlab_array = np.zeros((CARDINALITY, NUM_COLUMNS, NUM_CELLS))
for key, value in results_dict.items():
matlab_array[:, 0, key] = np.transpose(value[0])
matlab_array[:, 1, key] = np.transpose(value[1])
matlab_array[:, 2, key] = np.transpose(value[2])
Trying to follow your description, I can successfully write and read such a dictionary
In an ipython session:
In [48]: from scipy.io import savemat, loadmat
In [49]: adict = {'a':[np.arange(3),np.ones(3),np.array([4,2,1])]}
In [50]: adict['b'] = [np.arange(3),np.ones(3),np.array([4,2,1])]
In [51]: adict
Out[51]:
{'a': [array([0, 1, 2]), array([1., 1., 1.]), array([4, 2, 1])],
'b': [array([0, 1, 2]), array([1., 1., 1.]), array([4, 2, 1])]}
In [52]: pwd
Out[52]: '/home/paul/mypy'
In [53]: savemat('stack48385062.mat',adict, oned_as='row')
In [54]: data = loadmat('stack48385062.mat')
In [55]: data
Out[55]:
{'__globals__': [],
'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Mon Jan 22 09:15:31 2018',
'__version__': '1.0',
'a': array([[0., 1., 2.],
[1., 1., 1.],
[4., 2., 1.]]),
'b': array([[0., 1., 2.],
[1., 1., 1.],
[4., 2., 1.]])}
The lists of arrays (of constant size) were converted to 2d arrays.
In an Octave session:
>> load stack48385062.mat
>> a
a =
0 1 2
1 1 1
4 2 1
>> b
b =
0 1 2
1 1 1
4 2 1
>>
Or creating your 3d array (using a numeric index rather than string key):
In [56]: M=np.zeros([3, 3, 2])
In [57]: for i in range(len(adict)):
...: for j in range(3):
...: v = adict[list(adict.keys())[i]]
...: M[:, j, i] = v[j]
...:
In [58]: M
Out[58]:
array([[[0., 0.],
[1., 1.],
[4., 4.]],
[[1., 1.],
[1., 1.],
[2., 2.]],
[[2., 2.],
[1., 1.],
[1., 1.]]])
>> load stack48385062_1.mat
>> M
M =
ans(:,:,1) =
0 1 4
1 1 2
2 1 1
ans(:,:,2) =
0 1 4
1 1 2
2 1 1
I should have made the initial dictionary with a list of 3 of 4 element arrays, so it would be easier to track track transpositions. MATLAB and numpy have different axis orders, which can be confusing. savemat tries to compensate.

Save one-hot-encoded features into Pandas DataFrame the fastest way

I have a Pandas DataFrame with all my features and labels. One of my feature is categorical and needs to be one-hot-encoded.
The feature is an integer and can only have values from 0 to 4
To save those arrays back in my DataFrame I use the following code
# enc is my OneHotEncoder object
df['mycol'] = df['mycol'].map(lambda x: enc.transform(x).toarray())
My DataFrame has more than 1 million rows so the above code takes a while.Is there a faster way to assign the arrays to the DataFrame cells? Because I have just 5 categories i dont need to call the transform() function 1 million times.
I already tried something like
num_categories = 5
i = 0
while (i<num_categories):
df.loc[df['mycol'] == i, 'mycol'] = enc.transform(i).toarray()
i += 1
Which yields this error
ValueError: Must have equal len keys and value when setting with an ndarray
You can use pd.get_dummies:
>>> s
0 a
1 b
2 c
3 a
dtype: object
>>> pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
Alternatively:
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> a = np.array([1, 1, 3, 2, 2]).reshape(-1, 1)
>>> a
array([[1],
[1],
[3],
[2],
[2]]
>>> one_hot = enc.fit_transform(a)
>>> one_hot.toarray()
array([[ 1., 0., 0.],
[ 1., 0., 0.],
[ 0., 0., 1.],
[ 0., 1., 0.],
[ 0., 1., 0.]])

Clustering uni-variate Time series using sklearn

I have a panda DataFrame from which, i would like to do clustering for each columns. I am using sklearn and this is what i have:
data= pd.read_csv("data.csv")
data=pd.DataFrame(data)
data=data.set_index("Time")
#print(data)
cluster_numbers=2
list_of_cluster=[]
for k,v in data.iteritems():
temp=KMeans(n_clusters=cluster_numbers)
temp.fit(data[k])
print(k)
print("predicted",temp.predict(data[k]))
list_of_cluster.append(temp.predict(data[k]))
when i try to run it, i have this error: ValueError: n_samples=1 should be >= n_clusters=2
I am wondering what is the problem as i have more samples than number of clusters. Any help will be appreciated
The K-Means clusterer expects a 2D array, each row a data point, which can also be one-dimensional. In your case you have to reshape the pandas column to a matrix having len(data) rows and 1 column. See below an example that works:
from sklearn.cluster import KMeans
import pandas as pd
data = {'one': [1., 2., 3., 4., 3., 2., 1.], 'two': [4., 3., 2., 1., 2., 3., 4.]}
data = pd.DataFrame(data)
n_clusters = 2
for col in data.columns:
kmeans = KMeans(n_clusters=n_clusters)
X = data[col].reshape(-1, 1)
kmeans.fit(X)
print "{}: {}".format(col, kmeans.predict(X))

Categories

Resources