Related
I would like to raise a vector by ascending powers form 0 to 5:
import numpy as np
a = np.array([1, 2, 3]) # list of 11 components
b = np.array([0, 1, 2, 3, 4]) # power
c = np.power(a,b)
desired results are:
c = [[1**0, 1**1, 1**2, 1**3, 1**4], [2**0, 2**1, ...], ...]
I keep getting this error:
ValueError: operands could not be broadcast together with shapes (3,) (5,)
One solution will be to add a new dimension to your array a
c = a[:,None]**b
# Using broadcasting :
# (3,1)**(4,) --> (3,4)
#
# [[1],
# c = [2], ** [0,1,2,3,4]
# [3]]
For more information check the numpy broadcasting documentation
Here's a solution:
num_of_powers = 5
num_of_components = 11
a = []
for i in range(1,num_of_components + 1):
a.append(np.repeat(i,num_of_powers))
b = list(range(num_of_powers))
c = np.power(a,b)
The output c would look like:
array([[ 1, 1, 1, 1, 1],
[ 1, 2, 4, 8, 16],
[ 1, 3, 9, 27, 81],
[ 1, 4, 16, 64, 256],
[ 1, 5, 25, 125, 625],
[ 1, 6, 36, 216, 1296],
[ 1, 7, 49, 343, 2401],
[ 1, 8, 64, 512, 4096],
[ 1, 9, 81, 729, 6561],
[ 1, 10, 100, 1000, 10000],
[ 1, 11, 121, 1331, 14641]], dtype=int32)
Your solution shows a broadcast error because as per the documentation:
If x1.shape != x2.shape, they must be broadcastable to a common shape (which becomes the shape of the output).
c = [[x**y for y in b] for x in a]
c = np.asarray(list(map(lambda x: np.power(a,x), b))).transpose()
You need to first create a matrix where the rows are repetitions of each number. This can be done with np.tile:
mat = np.tile(a, (len(b), 1)).transpose()
And then raise that to the power of b elementwise:
np.power(mat, b)
All together:
import numpy as np
nums = np.array([1, 2, 3]) # list of 11 components
powers = np.array([0, 1, 2, 3, 4]) # power
print(np.power(np.tile(nums, (len(powers), 1)).transpose(), powers))
Which will give:
[[ 1 1 1 1 1] # == [1**0, 1**1, 1**2, 1**3, 1**4]
[ 1 2 4 8 16] # == [2**0, 2**1, 2**2, 2**3, 2**4]
[ 1 3 9 27 81]] # == [3**0, 3**1, 3**2, 3**3, 3**4]
I don't know if the title is apprpriate or not, but let me show you what I want to do,
In [56]: import numpy as np
In [57]: a= np.random.rand(2,2,2); a
Out[57]:
array([[[0.4300565 , 0.82251319],
[0.56113378, 0.83284255]],
[[0.00822414, 0.28256243],
[0.16648411, 0.33381438]]])
In [58]: b=np.random.rand(2); b
Out[58]: array([0.8035224 , 0.09884653])
In [59]: np.stack(( np.column_stack((b,a[:,i,:])) for i in range(a.shape[1])))
Out[59]:
array([[[0.8035224 , 0.4300565 , 0.82251319],
[0.09884653, 0.00822414, 0.28256243]],
[[0.8035224 , 0.56113378, 0.83284255],
[0.09884653, 0.16648411, 0.33381438]]])
So, I want to stack an array as column to an inner axis. Is it possible to do the looping structure more efficiently and concisely in numpy? I tried with numpy insert but could not do it.
EDIT:
another example
In [110]: a= np.random.rand(5,3,3); a
Out[110]:
array([[[0.27506756, 0.82334411, 0.7004287 ],
[0.6834928 , 0.28457133, 0.6275462 ],
[0.49744358, 0.25131814, 0.56422852]],
[[0.82591597, 0.92367306, 0.04652992],
[0.98545051, 0.92813944, 0.14360307],
[0.85454081, 0.8254149 , 0.5637401 ]],
[[0.59545519, 0.41563571, 0.41937218],
[0.90980491, 0.30169504, 0.96630809],
[0.06713389, 0.64357544, 0.12901734]],
[[0.47566444, 0.33476802, 0.26635363],
[0.4678913 , 0.53028241, 0.03112231],
[0.68445959, 0.07113376, 0.86651669]],
[[0.66951982, 0.01827502, 0.43831829],
[0.02798567, 0.36880876, 0.55029074],
[0.40127051, 0.6311474 , 0.51015882]]])
In [111]: b= np.random.rand(5,2); b
Out[111]:
array([[0.01659589, 0.15320541],
[0.79025065, 0.28041334],
[0.56024173, 0.49317082],
[0.28229119, 0.46010724],
[0.72239851, 0.62075004]])
In [112]: np.stack(( np.column_stack((b,a[:,i,:])) for i in range(a.shape[1])))
Out[112]:
array([[[0.01659589, 0.15320541, 0.27506756, 0.82334411, 0.7004287 ],
[0.79025065, 0.28041334, 0.82591597, 0.92367306, 0.04652992],
[0.56024173, 0.49317082, 0.59545519, 0.41563571, 0.41937218],
[0.28229119, 0.46010724, 0.47566444, 0.33476802, 0.26635363],
[0.72239851, 0.62075004, 0.66951982, 0.01827502, 0.43831829]],
[[0.01659589, 0.15320541, 0.6834928 , 0.28457133, 0.6275462 ],
[0.79025065, 0.28041334, 0.98545051, 0.92813944, 0.14360307],
[0.56024173, 0.49317082, 0.90980491, 0.30169504, 0.96630809],
[0.28229119, 0.46010724, 0.4678913 , 0.53028241, 0.03112231],
[0.72239851, 0.62075004, 0.02798567, 0.36880876, 0.55029074]],
[[0.01659589, 0.15320541, 0.49744358, 0.25131814, 0.56422852],
[0.79025065, 0.28041334, 0.85454081, 0.8254149 , 0.5637401 ],
[0.56024173, 0.49317082, 0.06713389, 0.64357544, 0.12901734],
[0.28229119, 0.46010724, 0.68445959, 0.07113376, 0.86651669],
[0.72239851, 0.62075004, 0.40127051, 0.6311474 , 0.51015882]]])
A variation on concatenating is indexed assignment:
For the first example:
In [245]: a=np.arange(8).reshape(2,2,2); b=np.array([100,200])
In [246]: c = np.zeros((2,2,3), a.dtype)
In [247]: c[:,:,0]=b
In [248]: c[:,:,1:]=a.transpose(1,0,2)
In [249]: c
Out[249]:
array([[[100, 0, 1],
[200, 4, 5]],
[[100, 2, 3],
[200, 6, 7]]])
And for the second:
In [250]: a1 = np.arange(5*3*3).reshape(5,3,3)
In [251]: b1 = np.arange(10).reshape(5,2)
In [252]: c1 = np.zeros((3,5,5),a.dtype)
In [253]: c1[:,:,:2]=b1
In [254]: c1[:,:,2:]=a1.transpose(1,0,2)
In [255]: c1
Out[255]:
array([[[ 0, 1, 0, 1, 2],
[ 2, 3, 9, 10, 11],
[ 4, 5, 18, 19, 20],
[ 6, 7, 27, 28, 29],
[ 8, 9, 36, 37, 38]],
[[ 0, 1, 3, 4, 5],
[ 2, 3, 12, 13, 14],
[ 4, 5, 21, 22, 23],
[ 6, 7, 30, 31, 32],
[ 8, 9, 39, 40, 41]],
[[ 0, 1, 6, 7, 8],
[ 2, 3, 15, 16, 17],
[ 4, 5, 24, 25, 26],
[ 6, 7, 33, 34, 35],
[ 8, 9, 42, 43, 44]]])
Deriving the shape of c from a and b is left as an exercise for the reader. :)
np.stack (or np.array) over the iteration on 2nd axis is effectively a partial transpose (or interchange of the first 2 axes):
In [261]: np.stack([a[:,i,:] for i in range(a.shape[1])])
Out[261]:
array([[[0, 1],
[4, 5]],
[[2, 3],
[6, 7]]])
In [262]: a.transpose(1,0,2)
Out[262]:
array([[[0, 1],
[4, 5]],
[[2, 3],
[6, 7]]])
We could also iterate on the first axis, and join on the second with:
In [263]: np.stack(a, axis=1)
Out[263]:
array([[[0, 1],
[4, 5]],
[[2, 3],
[6, 7]]])
A refinement on Ankit's answer using concatenate is:
np.concatenate([np.repeat(b[None,:,None], 2, axis=0), a.transpose(1,0,2)], axis=2)
np.concatenate([np.repeat(b1[None,:,:], 3, axis=0), a1.transpose(1,0,2)], axis=2)
The below code worked for me.
>>> a= np.random.rand(2,2,2); a
array([[[0.52706506, 0.48344319],
[0.79027196, 0.90581149]],
[[0.25930158, 0.59498346],
[0.02164495, 0.63081622]]])
>>> b=np.random.rand(2); b
array([0.96890722, 0.93670425])
>>> a1 = a.transpose(1, 0, 2); a1
array([[[0.52706506, 0.48344319],
[0.25930158, 0.59498346]],
[[0.79027196, 0.90581149],
[0.02164495, 0.63081622]]])
>>> c = np.tile(b, (2, 1)); c
array([[0.43134454, 0.4042494 ],
[0.43134454, 0.4042494 ]])
>>> c = np.expand_dims(c,2); c
array([[[0.43134454],
[0.4042494 ]],
[[0.43134454],
[0.4042494 ]]])
>>> np.concatenate((c, a1), axis=2)
array([[[0.43134454, 0.52706506, 0.48344319],
[0.4042494 , 0.25930158, 0.59498346]],
[[0.43134454, 0.79027196, 0.90581149],
[0.4042494 , 0.02164495, 0.63081622]]])
Here I first repeated b using tile by same number as 2nd dimention of a in a new dimention.
Then I used concatication to concat b and a array.
For the 2nd example
>>> a= np.random.rand(5,3,3)
>>> a1 = a.transpose(1, 0, 2)
>>> b=np.random.rand(5, 2)
>>> c = np.tile(b, (3, 1, 1))
>>> np.concatenate((c, a1), axis=2)
Alright, here the given data;
There are three numpy arrays of the shapes:
(i, 4, 2), (i, 4, 3), (i, 4, 2)
the i is shared among them but is variable.
The dtype is float32 for everything.
The goal is to interweave them in a particular order. Let's look at the data at index 0 for these arrays:
[[-208. -16.]
[-192. -16.]
[-192. 0.]
[-208. 0.]]
[[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]]
[[ 0.49609375 0.984375 ]
[ 0.25390625 0.984375 ]
[ 0.25390625 0.015625 ]
[ 0.49609375 0.015625 ]]
In this case, the concatened target array would look something like this:
[-208, -16, 1, 1, 1, 0.496, 0.984, -192, -16, 1, 1, 1, ...]
And then continue on with index 1.
I don't know how to achieve this, as the concatenate function just keeps telling me that the shapes don't match. The shape of the target array does not matter much, just that the memoryview of it must be in the given order for upload to a gpu shader.
Edit: I could achieve this with a few python for loops, but the performance impact would be a problem in this program.
Use np.dstack and flatten with np.ravel() -
np.dstack((a,b,c)).ravel()
Now, np.dstack is basically stacking along the third axis. So, alternatively we can use np.concatenate too along that axis, like so -
np.concatenate((a,b,c),axis=2).ravel()
Sample run -
1) Setup Input arrays :
In [613]: np.random.seed(1234)
...: n = 3
...: m = 2
...: a = np.random.randint(0,9,(n,m,2))
...: b = np.random.randint(11,99,(n,m,2))
...: c = np.random.randint(101,999,(n,m,2))
...:
2) Check input values :
In [614]: a
Out[614]:
array([[[3, 6],
[5, 4]],
[[8, 1],
[7, 6]],
[[8, 0],
[5, 0]]])
In [615]: b
Out[615]:
array([[[84, 58],
[61, 87]],
[[48, 45],
[49, 78]],
[[22, 11],
[86, 91]]])
In [616]: c
Out[616]:
array([[[104, 359],
[376, 560]],
[[472, 720],
[566, 115]],
[[344, 556],
[929, 591]]])
3) Output :
In [617]: np.dstack((a,b,c)).ravel()
Out[617]:
array([ 3, 6, 84, 58, 104, 359, 5, 4, 61, 87, 376, 560, 8,
1, 48, 45, 472, 720, 7, 6, 49, 78, 566, 115, 8, 0,
22, 11, 344, 556, 5, 0, 86, 91, 929, 591])
What I would do is:
np.hstack([a, b, c]).flatten()
assuming a, b, c are the three arrays
I have a numpy array with shape (3, 600219), which is a list of indices.
i.e.
array([[ 0, 0, 0, ..., 2879, 2879, 2879],
[ 40, 40, 40, ..., 162, 165, 168],
[ 249, 250, 251, ..., 195, 196, 198]])
The first row are time indices, the second and third rows are indices of the coordinates. I am trying to figure out which pair of coordinates most frequently occurred, disregarding the time.
e.g. Was it (49,249) or (40,250)...etc.?
I just used a small sample of your data, but I think you'll get the point:
import numpy as np
array = np.array([[ 0, 0, 0, 2879, 2879, 2879],
[ 40, 40, 40, 162, 165, 168],
[ 249, 250, 251, 195, 196, 198]])
# Zip together only the second and third rows
only_coords = zip(array[1,:], array[2,:])
from collections import Counter
Counter(only_coords).most_common()
Produces:
Out[11]:
[((40, 249), 1),
((165, 196), 1),
((162, 195), 1),
((168, 198), 1),
((40, 251), 1),
((40, 250), 1)]
Here's one vectorized approach -
IDs = a[1].max()+1 + a[2]
unq, idx, count = np.unique(IDs, return_index=1,return_counts=1)
out = a[1:,idx[count.argmax()]]
If there could be negative coordinates, use a[1].max()-a[1].min()+1 + a[2] to compute IDs.
Sample run -
In [44]: a
Out[44]:
array([[8, 3, 6, 6, 8, 5, 1, 6, 6, 5],
[5, 2, 1, 1, 5, 1, 5, 1, 1, 4],
[8, 2, 3, 3, 8, 1, 7, 3, 3, 3]])
In [47]: IDs = a[1].max()+1 + a[2]
In [48]: unq, idx, count = np.unique(IDs, return_index=1,return_counts=1)
In [49]: a[1:,idx[count.argmax()]]
Out[49]: array([1, 3])
This might seem a little abstract, but you could try saving each co-ordinate as a number, e.g. [2,1] = 2.1. And put your data into a list of these co-ordinates. For example, a 2nd row of [1,1,2] and 3rd row of [2,2,1] would be [1.2, 1.2, 2.1] You could then use the code:
from collections import Counter
list1=[1.2,1.2,2.1]
data = Counter(list1)
print (data.most_common(1)) # Returns the highest occurring item
which prints the most common number, and how many times it occurs, then you can simply convert the number back to a co-ordinate if you need to use it in your code.
Here is a sample code that does the count:
import numpy as np
import collections
a = np.array([[0, 1, 2, 3], [10, 10, 30 ,40], [25, 25, 10, 50]])
# You don't care about time
b = np.transpose(a[1:])
# convert list items to tuples
c = map(lambda v:tuple(v), b)
collections.Counter(c)
The output:
Counter({(10, 25): 2, (30, 10): 1, (40, 50): 1})
I encoded my categorical data using sklearn.OneHotEncoder and fed them to a random forest classifier. Everything seems to work and I got my predicted output back.
Is there a way to reverse the encoding and convert my output back to its original state?
A good systematic way to figure this out is to start with some test data and work through the sklearn.OneHotEncoder source with it. If you don't much care about how it works and simply want a quick answer, skip to the bottom.
X = np.array([
[3, 10, 15, 33, 54, 55, 78, 79, 80, 99],
[5, 1, 3, 7, 8, 12, 15, 19, 20, 8]
]).T
n_values_
Lines 1763-1786 determine the n_values_ parameter. This will be determined automatically if you set n_values='auto' (the default). Alternatively you can specify a maximum value for all features (int) or a maximum value per feature (array). Let's assume that we're using the default. So the following lines execute:
n_samples, n_features = X.shape # 10, 2
n_values = np.max(X, axis=0) + 1 # [100, 21]
self.n_values_ = n_values
feature_indices_
Next the feature_indices_ parameter is calculated.
n_values = np.hstack([[0], n_values]) # [0, 100, 21]
indices = np.cumsum(n_values) # [0, 100, 121]
self.feature_indices_ = indices
So feature_indices_ is merely the cumulative sum of n_values_ with a 0 prepended.
Sparse Matrix Construction
Next, a scipy.sparse.coo_matrix is constructed from the data. It is initialized from three arrays: the sparse data (all ones), the row indices, and the column indices.
column_indices = (X + indices[:-1]).ravel()
# array([ 3, 105, 10, 101, 15, 103, 33, 107, 54, 108, 55, 112, 78, 115, 79, 119, 80, 120, 99, 108])
row_indices = np.repeat(np.arange(n_samples, dtype=np.int32), n_features)
# array([0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9], dtype=int32)
data = np.ones(n_samples * n_features)
# array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
out = sparse.coo_matrix((data, (row_indices, column_indices)),
shape=(n_samples, indices[-1]),
dtype=self.dtype).tocsr()
# <10x121 sparse matrix of type '<type 'numpy.float64'>' with 20 stored elements in Compressed Sparse Row format>
Note that the coo_matrix is immediately converted to a scipy.sparse.csr_matrix. The coo_matrix is used as an intermediate format because it "facilitates fast conversion among sparse formats."
active_features_
Now, if n_values='auto', the sparse csr matrix is compressed down to only the columns with active features. The sparse csr_matrix is returned if sparse=True, otherwise it is densified before returning.
if self.n_values == 'auto':
mask = np.array(out.sum(axis=0)).ravel() != 0
active_features = np.where(mask)[0] # array([ 3, 10, 15, 33, 54, 55, 78, 79, 80, 99, 101, 103, 105, 107, 108, 112, 115, 119, 120])
out = out[:, active_features] # <10x19 sparse matrix of type '<type 'numpy.float64'>' with 20 stored elements in Compressed Sparse Row format>
self.active_features_ = active_features
return out if self.sparse else out.toarray()
Decoding
Now let's work in reverse. We'd like to know how to recover X given the sparse matrix that is returned along with the OneHotEncoder features detailed above. Let's assume we actually ran the code above by instantiating a new OneHotEncoder and running fit_transform on our data X.
from sklearn import preprocessing
ohc = preprocessing.OneHotEncoder() # all default params
out = ohc.fit_transform(X)
The key insight to solving this problem is understanding the relationship between active_features_ and out.indices. For a csr_matrix, the indices array contains the column numbers for each data point. However, these column numbers are not guaranteed to be sorted. To sort them, we can use the sorted_indices method.
out.indices # array([12, 0, 10, 1, 11, 2, 13, 3, 14, 4, 15, 5, 16, 6, 17, 7, 18, 8, 14, 9], dtype=int32)
out = out.sorted_indices()
out.indices # array([ 0, 12, 1, 10, 2, 11, 3, 13, 4, 14, 5, 15, 6, 16, 7, 17, 8, 18, 9, 14], dtype=int32)
We can see that before sorting, the indices are actually reversed along the rows. In other words, they are ordered with the last column first and the first column last. This is evident from the first two elements: [12, 0]. 0 corresponds to the 3 in the first column of X, since 3 is the minimum element it was assigned to the first active column. 12 corresponds to the 5 in the second column of X. Since the first row occupies 10 distinct columns, the minimum element of the second column (1) gets index 10. The next smallest (3) gets index 11, and the third smallest (5) gets index 12. After sorting, the indices are ordered as we would expect.
Next we look at active_features_:
ohc.active_features_ # array([ 3, 10, 15, 33, 54, 55, 78, 79, 80, 99, 101, 103, 105, 107, 108, 112, 115, 119, 120])
Notice that there are 19 elements, which corresponds to the number of distinct elements in our data (one element, 8, was repeated once). Notice also that these are arranged in order. The features that were in the first column of X are the same, and the features in the second column have simply been summed with 100, which corresponds to ohc.feature_indices_[1].
Looking back at out.indices, we can see that the maximum column number is 18, which is one minus the 19 active features in our encoding. A little thought about the relationship here shows that the indices of ohc.active_features_ correspond to the column numbers in ohc.indices. With this, we can decode:
import numpy as np
decode_columns = np.vectorize(lambda col: ohc.active_features_[col])
decoded = decode_columns(out.indices).reshape(X.shape)
This gives us:
array([[ 3, 105],
[ 10, 101],
[ 15, 103],
[ 33, 107],
[ 54, 108],
[ 55, 112],
[ 78, 115],
[ 79, 119],
[ 80, 120],
[ 99, 108]])
And we can get back to the original feature values by subtracting off the offsets from ohc.feature_indices_:
recovered_X = decoded - ohc.feature_indices_[:-1]
array([[ 3, 5],
[10, 1],
[15, 3],
[33, 7],
[54, 8],
[55, 12],
[78, 15],
[79, 19],
[80, 20],
[99, 8]])
Note that you will need to have the original shape of X, which is simply (n_samples, n_features).
TL;DR
Given the sklearn.OneHotEncoder instance called ohc, the encoded data (scipy.sparse.csr_matrix) output from ohc.fit_transform or ohc.transform called out, and the shape of the original data (n_samples, n_feature), recover the original data X with:
recovered_X = np.array([ohc.active_features_[col] for col in out.sorted_indices().indices])
.reshape(n_samples, n_features) - ohc.feature_indices_[:-1]
Just compute dot-product of the encoded values with ohe.active_features_. It works both for sparse and dense representation. Example:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
orig = np.array([6, 9, 8, 2, 5, 4, 5, 3, 3, 6])
ohe = OneHotEncoder()
encoded = ohe.fit_transform(orig.reshape(-1, 1)) # input needs to be column-wise
decoded = encoded.dot(ohe.active_features_).astype(int)
assert np.allclose(orig, decoded)
The key insight is that the active_features_ attribute of the OHE model represents the original values for each binary column. Thus we can decode the binary-encoded number by simply computing a dot-product with active_features_. For each data point there's just a single 1 the position of the original value.
Use numpy.argmax() with axis = 1.
Example:
ohe_encoded = np.array([[0, 0, 1], [0, 1, 0], [0, 1, 0], [1, 0, 0]])
ohe_encoded
> array([[0, 0, 1],
[0, 1, 0],
[0, 1, 0],
[1, 0, 0]])
np.argmax(ohe_encoded, axis = 1)
> array([2, 1, 1, 0], dtype=int64)
Since version 0.20 of scikit-learn, the active_features_ attribute of the OneHotEncoder class has been deprecated, so I suggest to rely on the categories_ attribute instead.
The below function can help you recover the original data from a matrix that has been one-hot encoded:
def reverse_one_hot(X, y, encoder):
reversed_data = [{} for _ in range(len(y))]
all_categories = list(itertools.chain(*encoder.categories_))
category_names = ['category_{}'.format(i+1) for i in range(len(encoder.categories_))]
category_lengths = [len(encoder.categories_[i]) for i in range(len(encoder.categories_))]
for row_index, feature_index in zip(*X.nonzero()):
category_value = all_categories[feature_index]
category_name = get_category_name(feature_index, category_names, category_lengths)
reversed_data[row_index][category_name] = category_value
reversed_data[row_index]['target'] = y[row_index]
return reversed_data
def get_category_name(index, names, lengths):
counter = 0
for i in range(len(lengths)):
counter += lengths[i]
if index < counter:
return names[i]
raise ValueError('The index is higher than the number of categorical values')
To test it, I have created a small data set that includes the ratings that users have given to users
data = [
{'user_id': 'John', 'item_id': 'The Matrix', 'rating': 5},
{'user_id': 'John', 'item_id': 'Titanic', 'rating': 1},
{'user_id': 'John', 'item_id': 'Forrest Gump', 'rating': 2},
{'user_id': 'John', 'item_id': 'Wall-E', 'rating': 2},
{'user_id': 'Lucy', 'item_id': 'The Matrix', 'rating': 5},
{'user_id': 'Lucy', 'item_id': 'Titanic', 'rating': 1},
{'user_id': 'Lucy', 'item_id': 'Die Hard', 'rating': 5},
{'user_id': 'Lucy', 'item_id': 'Forrest Gump', 'rating': 2},
{'user_id': 'Lucy', 'item_id': 'Wall-E', 'rating': 2},
{'user_id': 'Eric', 'item_id': 'The Matrix', 'rating': 2},
{'user_id': 'Eric', 'item_id': 'Die Hard', 'rating': 3},
{'user_id': 'Eric', 'item_id': 'Forrest Gump', 'rating': 5},
{'user_id': 'Eric', 'item_id': 'Wall-E', 'rating': 4},
{'user_id': 'Diane', 'item_id': 'The Matrix', 'rating': 4},
{'user_id': 'Diane', 'item_id': 'Titanic', 'rating': 3},
{'user_id': 'Diane', 'item_id': 'Die Hard', 'rating': 5},
{'user_id': 'Diane', 'item_id': 'Forrest Gump', 'rating': 3},
]
data_frame = pandas.DataFrame(data)
data_frame = data_frame[['user_id', 'item_id', 'rating']]
ratings = data_frame['rating']
data_frame.drop(columns=['rating'], inplace=True)
If we are building a prediction model, we have to remember to delete the dependent variable (in this case the rating) from the DataFrame before we encode it.
ratings = data_frame['rating']
data_frame.drop(columns=['rating'], inplace=True)
Then we proceed to do the encoding
ohc = OneHotEncoder()
encoded_data = ohc.fit_transform(data_frame)
print(encoded_data)
Which results in:
(0, 2) 1.0
(0, 6) 1.0
(1, 2) 1.0
(1, 7) 1.0
(2, 2) 1.0
(2, 5) 1.0
(3, 2) 1.0
(3, 8) 1.0
(4, 3) 1.0
(4, 6) 1.0
(5, 3) 1.0
(5, 7) 1.0
(6, 3) 1.0
(6, 4) 1.0
(7, 3) 1.0
(7, 5) 1.0
(8, 3) 1.0
(8, 8) 1.0
(9, 1) 1.0
(9, 6) 1.0
(10, 1) 1.0
(10, 4) 1.0
(11, 1) 1.0
(11, 5) 1.0
(12, 1) 1.0
(12, 8) 1.0
(13, 0) 1.0
(13, 6) 1.0
(14, 0) 1.0
(14, 7) 1.0
(15, 0) 1.0
(15, 4) 1.0
(16, 0) 1.0
(16, 5) 1.0
After encoding the we can reverse using the reverse_one_hot function we defined above, like this:
reverse_data = reverse_one_hot(encoded_data, ratings, ohc)
print(pandas.DataFrame(reverse_data))
Which gives us:
category_1 category_2 target
0 John The Matrix 5
1 John Titanic 1
2 John Forrest Gump 2
3 John Wall-E 2
4 Lucy The Matrix 5
5 Lucy Titanic 1
6 Lucy Die Hard 5
7 Lucy Forrest Gump 2
8 Lucy Wall-E 2
9 Eric The Matrix 2
10 Eric Die Hard 3
11 Eric Forrest Gump 5
12 Eric Wall-E 4
13 Diane The Matrix 4
14 Diane Titanic 3
15 Diane Die Hard 5
16 Diane Forrest Gump 3
If the features are dense, like [1,2,4,5,6], with several number missed. Then, we can mapping them to corresponding positions.
>>> import numpy as np
>>> from scipy import sparse
>>> def _sparse_binary(y):
... # one-hot codes of y with scipy.sparse matrix.
... row = np.arange(len(y))
... col = y - y.min()
... data = np.ones(len(y))
... return sparse.csr_matrix((data, (row, col)))
...
>>> y = np.random.randint(-2,2, 8).reshape([4,2])
>>> y
array([[ 0, -2],
[-2, 1],
[ 1, 0],
[ 0, -2]])
>>> yc = [_sparse_binary(y[:,i]) for i in xrange(2)]
>>> for i in yc: print i.todense()
...
[[ 0. 0. 1. 0.]
[ 1. 0. 0. 0.]
[ 0. 0. 0. 1.]
[ 0. 0. 1. 0.]]
[[ 1. 0. 0. 0.]
[ 0. 0. 0. 1.]
[ 0. 0. 1. 0.]
[ 1. 0. 0. 0.]]
>>> [i.shape for i in yc]
[(4, 4), (4, 4)]
This is a compromised and simple method, but works and easy to reverse by argmax(), e.g.:
>>> np.argmax(yc[0].todense(), 1) + y.min(0)[0]
matrix([[ 0],
[-2],
[ 1],
[ 0]])
How to one-hot encode
See https://stackoverflow.com/a/42874726/562769
import numpy as np
nb_classes = 6
data = [[2, 3, 4, 0]]
def indices_to_one_hot(data, nb_classes):
"""Convert an iterable of indices to one-hot encoded labels."""
targets = np.array(data).reshape(-1)
return np.eye(nb_classes)[targets]
How to reverse
def one_hot_to_indices(data):
indices = []
for el in data:
indices.append(list(el).index(1))
return indices
hot = indices_to_one_hot(orig_data, nb_classes)
indices = one_hot_to_indices(hot)
print(orig_data)
print(indices)
gives:
[[2, 3, 4, 0]]
[2, 3, 4, 0]
The short answer is "no". The encoder takes your categorical data and automagically transforms it to a reasonable set of numbers.
The longer answer is "not automatically". If you provide an explicit mapping using the n_values parameter, though, you can probably implement own decoding at the other side. See the documentation for some hints on how that might be done.
That said, this is a fairly strange question. You may want to, instead, use a DictVectorizer
Pandas approach :
To convert categorical variables to binary variables, pd.get_dummies does that and to convert them back, you can find the index of the value where there is 1 using pd.Series.idxmax(). Then you can map to a list(index in according to original data) or dictionary.
import pandas as pd
import numpy as np
col = np.random.randint(1,5,20)
df = pd.DataFrame({'A': col})
df.head()
A
0 2
1 2
2 1
3 1
4 3
df_dum = pd.get_dummies(df['A'])
df_dum.head()
1 2 3 4
0 0 1 0 0
1 0 1 0 0
2 1 0 0 0
3 1 0 0 0
4 0 0 1 0
df_n = df_dum.apply(lambda x: x.idxmax(), axis = 1)
df_n.head()
0 2
1 2
2 1
3 1
4 3