I have a Pandas DataFrame with all my features and labels. One of my feature is categorical and needs to be one-hot-encoded.
The feature is an integer and can only have values from 0 to 4
To save those arrays back in my DataFrame I use the following code
# enc is my OneHotEncoder object
df['mycol'] = df['mycol'].map(lambda x: enc.transform(x).toarray())
My DataFrame has more than 1 million rows so the above code takes a while.Is there a faster way to assign the arrays to the DataFrame cells? Because I have just 5 categories i dont need to call the transform() function 1 million times.
I already tried something like
num_categories = 5
i = 0
while (i<num_categories):
df.loc[df['mycol'] == i, 'mycol'] = enc.transform(i).toarray()
i += 1
Which yields this error
ValueError: Must have equal len keys and value when setting with an ndarray
You can use pd.get_dummies:
>>> s
0 a
1 b
2 c
3 a
dtype: object
>>> pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
Alternatively:
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> a = np.array([1, 1, 3, 2, 2]).reshape(-1, 1)
>>> a
array([[1],
[1],
[3],
[2],
[2]]
>>> one_hot = enc.fit_transform(a)
>>> one_hot.toarray()
array([[ 1., 0., 0.],
[ 1., 0., 0.],
[ 0., 0., 1.],
[ 0., 1., 0.],
[ 0., 1., 0.]])
Related
I'm trying to preprossessing data with the OneHotEncoder of scikitlearn. Obviously, I'm doing something wrong. Here is my sample program :
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
cat = ['ok', 'ko', 'maybe', 'maybe']
label_encoder = LabelEncoder()
label_encoder.fit(cat)
cat = label_encoder.transform(cat)
# returns [2 0 1 1], which seams good.
print(cat)
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
res = ct.fit_transform([cat])
print(res)
Final result : [[1.0 0 1 1]]
Expected result : something like :
[
[ 1 0 0 ]
[ 0 0 1 ]
[ 0 1 0 ]
[ 0 1 0 ]
]
Can someone point out what I'm missing ?
You can consider to using numpy and MultiLabelBinarizer.
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
cat = np.array([['ok', 'ko', 'maybe', 'maybe']])
m = MultiLabelBinarizer()
print(m.fit_transform(cat.T))
If you still want to stick with your solution. You just need to update as the following:
# because of it still a row, not a column
# res = ct.fit_transform([cat]) => remove this
# it should works
res = ct.fit_transform(np.array([cat]).T)
Out[2]:
array([[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.],
[0., 1., 0.]])
Given a 2D array, I'm looking for a pythonic way to get an array of same shape, with only the maximum element per each row.
See max_row_filter function below
def max_row_filter(mat2d):
m = np.zeros(mat2d.shape)
for r in range(mat2d.shape[0]):
c = np.argmax(mat2d[r])
m[r,c]=mat2d[r,c]
return m
p = np.array([[1,2,3],[5,4,3,],[9,10,3]])
max_row_filter(p)
Out: array([[ 0., 0., 3.],
[ 5., 0., 0.],
[ 0., 10., 0.]])
I'm looking for an efficient way to do this, suitable to be done on big arrays.
Alternative answer (this will keep duplicates):
p * (p==p.max(axis=1, keepdims=True))
If there are no duplicates, you could use numpy.argmax:
import numpy as np
p = np.array([[1, 2, 3],
[5, 4, 3, ],
[9, 10, 3]])
result = np.zeros_like(p)
rows, cols = zip(*enumerate(np.argmax(p, axis=1)))
result[rows, cols] = p[rows, cols]
print(result)
Output
[[ 0 0 3]
[ 5 0 0]
[ 0 10 0]]
Note that, for multiple occurrences argmax return the first occurence.
I have the following dictionary results_dict in Python 3.2 where the key field is a string value and the value field is a list of 3 arrays. Each array has 400 float values. I want to convert this dictionary into a data structure that can be used in Matlab 2017b. However, if I execute the following:
savemat('GridCellResults.mat', results_dict, oned_as='row');
The command executes successfully but Matlab is not able to understand the matrix file. For this reason, I wrote the following code to convert the previous dictionary into a 3 Dimensional Matrix (X,Y,Z) where X is the size of the array (400 Elements) and Y is the number of arrays for each dictionary key (3 Arrays) and Z is the number of elements in the dictionary. However, when I execute the code below I get the following error:
IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
Here is the code. Any clue why I am getting this error. Also even if I try without the transpose function i keep getting the same error.
import numpy as np
CARDINALITY = 400 # Number of angular domain values.
NUM_COLUMNS = 3
NUM_CELLS = 114
matlab_array = np.zeros((CARDINALITY, NUM_COLUMNS, NUM_CELLS))
for key, value in results_dict.items():
matlab_array[:, 0, key] = np.transpose(value[0])
matlab_array[:, 1, key] = np.transpose(value[1])
matlab_array[:, 2, key] = np.transpose(value[2])
Trying to follow your description, I can successfully write and read such a dictionary
In an ipython session:
In [48]: from scipy.io import savemat, loadmat
In [49]: adict = {'a':[np.arange(3),np.ones(3),np.array([4,2,1])]}
In [50]: adict['b'] = [np.arange(3),np.ones(3),np.array([4,2,1])]
In [51]: adict
Out[51]:
{'a': [array([0, 1, 2]), array([1., 1., 1.]), array([4, 2, 1])],
'b': [array([0, 1, 2]), array([1., 1., 1.]), array([4, 2, 1])]}
In [52]: pwd
Out[52]: '/home/paul/mypy'
In [53]: savemat('stack48385062.mat',adict, oned_as='row')
In [54]: data = loadmat('stack48385062.mat')
In [55]: data
Out[55]:
{'__globals__': [],
'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Mon Jan 22 09:15:31 2018',
'__version__': '1.0',
'a': array([[0., 1., 2.],
[1., 1., 1.],
[4., 2., 1.]]),
'b': array([[0., 1., 2.],
[1., 1., 1.],
[4., 2., 1.]])}
The lists of arrays (of constant size) were converted to 2d arrays.
In an Octave session:
>> load stack48385062.mat
>> a
a =
0 1 2
1 1 1
4 2 1
>> b
b =
0 1 2
1 1 1
4 2 1
>>
Or creating your 3d array (using a numeric index rather than string key):
In [56]: M=np.zeros([3, 3, 2])
In [57]: for i in range(len(adict)):
...: for j in range(3):
...: v = adict[list(adict.keys())[i]]
...: M[:, j, i] = v[j]
...:
In [58]: M
Out[58]:
array([[[0., 0.],
[1., 1.],
[4., 4.]],
[[1., 1.],
[1., 1.],
[2., 2.]],
[[2., 2.],
[1., 1.],
[1., 1.]]])
>> load stack48385062_1.mat
>> M
M =
ans(:,:,1) =
0 1 4
1 1 2
2 1 1
ans(:,:,2) =
0 1 4
1 1 2
2 1 1
I should have made the initial dictionary with a list of 3 of 4 element arrays, so it would be easier to track track transpositions. MATLAB and numpy have different axis orders, which can be confusing. savemat tries to compensate.
I am trying to assign new values to an array based on whether or not the stored value is <3. Coming from an R background this is how I would do it, but this gives me a syntax error in Python. What am I doing wrong, and what is the Python approach?
eurx=[1,2,3,4,5,6,7,'a',8]
sma50=3
tw=eurx
tw[eurx<sma50]=-1
tw[eurx>=sma50]=1
tw[(tw!=1)||(tw!=-1)]=0
print(tw)
GOAL:
-1
-1
1
1
1
1
1
0
1
This is "too much R". A pythonic way would be to use functional filtering:
>>> map(lambda i: -2*int(i<sma50)+1 if type(i) == int else 0, eurx)
[-1, -1, 1, 1, 1, 1, 1, 0, 1]
Or just a simple for-loop with a few ifs:
>>> for i in eurx:
... if type(i) != int:
... print 0
... else:
... print -2*int(i<sma50)+1
...
-1
-1
1
1
1
1
1
0
1
In general: don't try to guess the syntax. It's very simple, just read through some tutorials (e.g. https://docs.python.org/3/tutorial/introduction.html#first-steps-towards-programming)
Edit: the int conversion hack works as follows: you know you can convert bool to int, right?
>>> int(True)
1
>>> int(False)
0
If i<sma50 evaluates to True, int(i<sma50) will be 1. So yor numbers now are converted to ones if i is smaller than sma50 and to zeros otherwise. But apparently you want the values (-1, 1) instead of (1, 0). Just apply the transform -2x+1 and you're done!
Your desired syntax is pretty close to what you'd write in numpy.
The heterogeneous list doesn't make it easy, but here's an example:
>>> import numpy as np
>>> eurx=[1,2,3,4,5,6,7,'a',8]
>>> sma50 = 3
>>> tw = np.array([i if isinstance(i, int) else np.nan for i in eurx])
>>> tw
array([ 1., 2., 3., 4., 5., 6., 7., nan, 8.])
>>> tw[tw < sma50] = -1
__main__:1: RuntimeWarning: invalid value encountered in less
>>> tw[tw >= sma50] = 1
__main__:1: RuntimeWarning: invalid value encountered in greater_equal
>>> tw
array([ -1., -1., 1., 1., 1., 1., 1., nan, 1.])
>>> tw[np.isnan(tw)] = 0
>>> tw
array([-1., -1., 1., 1., 1., 1., 1., 0., 1.])
I have a dataframe, df with numbers, like so:
1 1 1
2 1 1
2 1 3
I'd like to deduct the median from each column so that the median of each becomes 0.
-1 0 0
0 0 0
0 0 2
How do I do this in a pythandic way? I'm guessing it is possible without iterating over the values, computing the median and then deducting. I'd like to do it tersely, approximately like so:
from numpy import median
df -= median(df) #does not work, deducts median for whole dataframe
Just like this
df -= df.median(axis=0)
median of numpy computes median of overall data.
To accomplish using numpy, try this code instead.
df -= median(df, axis=0)
for more detail, see the document: http://docs.scipy.org/doc/numpy/reference/generated/numpy.median.html
Some testing in ipython showed:
In [23]: A = numpy.arange(9)
In [24]: B = A.reshape((3,3))
In [25]: C = numpy.median(B,axis=0)
In [26]: D = B - C[None,:]
In [27]: B
Out[27]:
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
In [28]: D
Out[28]:
array([[-3., -3., -3.],
[ 0., 0., 0.],
[ 3., 3., 3.]])
In [29]: C
Out[29]: array([ 3., 4., 5.])
So the next line gets the median along the columns
C = numpy.median(B,axis=0)
And the next line subtracts it from the matrix, column by column
D = B - C[None,:]