How to transform this data for logistic regression?

How to transform this data for logistic regression? - python

I hava 'y' and 'X' data:
y = [1, 0, 0, 0, 0, 0, 0, 0 ...] its ok for my purpose
and
X = [['reg' '03b' '03e' 'buy']
['reg' '03b' '04e' 'sell']
['pref' '02b' '03e' 'sell']
['cur' '03b' '03e' 'buy']
['val' '03b' '03e' 'buy']
['reg' '03b' '03e' 'buy'] ...]
X[0] may take values : 'reg'/'pref'/'cur'/'val'
X[1] : string with number of mounth + b( = begin) at the end
X[2] : string with number of mounth + e( = end) at the end
X[3] : 'buy' or 'sell'
But I cant do
logreg = LogisticRegression()
logreg.fit(X,y)
Because I have troubles with structure of X (it is lists with strings)
I want to fix it and do:
logreg = preprocessing.LabelEncoder()
i=0
while i<len(X):
logreg.fit(X[i])
b[i]=logreg.transform(X[i])
i=i+1
But I get this:
[3 0 1 2]
[3 0 1 2]
[3 0 1 2]
[3 0 1 2]
[3 0 1 2]
[3 0 1 2]
...
[3 0 1 2]
All elements are the same. How can I correctly transform my data for .fit(X,y)?

The problem is that you mistake row and column in X.
import numpy as np
from sklearn import preprocessing
X = [['reg', '03b', '03e', 'buy'],
['reg', '03b', '04e', 'sell'],
['pref', '02b', '03e', 'sell'],
['cur', '03b', '03e', 'buy'],
['val', '03b', '03e', 'buy'],
['reg', '03b', '03e', 'buy']]
X = np.array(X)
b = np.zeros(X.shape)
logreg = preprocessing.LabelEncoder()
i = 0
while i < X.shape[1]:
logreg.fit(X[:,i])
b[:,i] = logreg.transform(X[:,i])
i += 1
b
array([[2., 1., 0., 0.],
[2., 1., 1., 1.],
[1., 0., 0., 1.],
[0., 1., 0., 0.],
[3., 1., 0., 0.],
[2., 1., 0., 0.]])

Related

Conversion between R and Python(indexing issue?)

I'm trying to convert this R function to Python. However, I am running into issues with the results being incorrect.
The R function in question:
construct_omega <- function(k){
E <- diag(2*k)
omega <- matrix(0, ncol=2*k, nrow=2*k)
for (i in 1:k){
omega <- omega +
E[,2*i-1] %*% t(E[,2*i]) -
E[,2*i] %*% t(E[,2*i-1])
}
return(omega)
}
This is my current attempt at porting the function to Python:
def construct_omega(k=1):
E = np.identity(2*k)
omega = np.zeros((2*k, 2*k))
for i in range(1, k):
omega = omega + \
E[:,2*i-1] * np.transpose(E[:,2*i]) - \
E[:,2*i] * np.transpose(E[:,2*i-1])
return omega
In R, the result matrix is this:
> construct_omega(2)
[,1] [,2] [,3] [,4]
[1,] 0 1 0 0
[2,] -1 0 0 0
[3,] 0 0 0 1
[4,] 0 0 -1 0
But in Python, the result is the 4x4 zero matrix.
Any help would be appreciated, thanks!

The issue here is an edge case of matrix multiplication in numpy. You should consult the docs and this post or this one. Basically what happens is that you are doing a dot product and getting a scalar, not a matrix as you assume. The fix is to use np.outer. The other issue is the indexing that in python starts at 0 so you need to rewrite your code a bit.
import numpy as np
def construct_omega(k=1):
E = np.identity(2*k)
omega = np.zeros((2*k, 2*k))
for i in range(k):
omega = omega + \
np.outer(E[:,2*i], E[:,2*i+1].T) - \
np.outer(E[:,2*i+1], E[:,2*i].T)
return omega
construct_omega(1)
#array([[ 0., 1.],
# [ -1., 0.]])
and
construct_omega(2)
#array([[ 0., 1., 0., 0.],
# [-1., 0., 0., 0.],
# [ 0., 0., 0., 1.],
# [ 0., 0., -1., 0.]])

Finding determinant with torch.det doesn't return 0?

I'm trying to find the determinant of a matrix using torch.det. However, it seems like I'm either not doing it right or the function is not working properly (the results should be 0 rather than a small number).
a = torch.tensor([1.0, 1.0])
b = torch.tensor([3.0, 3.0])
c = torch.stack([a,b], dim = 1)
print(c)
torch.det(d)
>>>tensor([[1., 3.],
[1., 3.]])
tensor(1.2517e-06)
Another example:
a = torch.tensor([2, -1, 1]).float()
b = torch.tensor([3, -4, -2]).float()
c = torch.tensor([5, -10, -8]).float()
d = torch.stack([a,b,c], dim = 1)
print(d)
print(torch.det(d))
>>>
tensor([[ 2., 3., 5.],
[ -1., -4., -10.],
[ 1., -2., -8.]])
tensor(1.2517e-06)
Update 1:
I think I had a typo in the first example (I restarted everything and reran it):
import torch
a = torch.tensor([1.0, 1.0])
b = torch.tensor([3.0, 3.0])
c = torch.stack([a,b], dim = 1)
print(c)
torch.det(c)
>>> tensor([[1., 3.],
[1., 3.]])
tensor(0.)
Though, I believe the second example should also be 0

Encoding with OneHotEncoder

I'm trying to preprossessing data with the OneHotEncoder of scikitlearn. Obviously, I'm doing something wrong. Here is my sample program :
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
cat = ['ok', 'ko', 'maybe', 'maybe']
label_encoder = LabelEncoder()
label_encoder.fit(cat)
cat = label_encoder.transform(cat)
# returns [2 0 1 1], which seams good.
print(cat)
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
res = ct.fit_transform([cat])
print(res)
Final result : [[1.0 0 1 1]]
Expected result : something like :
[
[ 1 0 0 ]
[ 0 0 1 ]
[ 0 1 0 ]
[ 0 1 0 ]
]
Can someone point out what I'm missing ?

You can consider to using numpy and MultiLabelBinarizer.
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
cat = np.array([['ok', 'ko', 'maybe', 'maybe']])
m = MultiLabelBinarizer()
print(m.fit_transform(cat.T))
If you still want to stick with your solution. You just need to update as the following:
# because of it still a row, not a column
# res = ct.fit_transform([cat]) => remove this
# it should works
res = ct.fit_transform(np.array([cat]).T)
Out[2]:
array([[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.],
[0., 1., 0.]])

Python: convert numpy array of signs to int and back

I'm trying to convert from a numpy array of signs (i.e., a numpy array whose entries are either 1. or -1.) to an integer and back through a binary representation. I have something that works, but it's not Pythonic, and I expect it'll be slow.
def sign2int(s):
s[s==-1.] = 0.
bstr = ''
for i in range(len(s)):
bstr = bstr + str(int(s[i]))
return int(bstr, 2)
def int2sign(i, m):
bstr = bin(i)[2:].zfill(m)
s = []
for d in bstr:
s.append(float(d))
s = np.array(s)
s[s==0.] = -1.
return s
Then
>>> m = 4
>>> s0 = np.array([1., -1., 1., 1.])
>>> i = sign2int(s0)
>>> print i
11
>>> s = int2sign(i, m)
>>> print s
[ 1. -1. 1. 1.]
I'm concerned about (1) the for loops in each and (2) having to build an intermediate representation as a string.
Ultimately, I will want something that works with a 2-d numpy array, too---e.g.,
>>> s = np.array([[1., -1., 1.], [1., 1., 1.]])
>>> print sign2int(s)
[5, 7]

For 1d arrays you can use this one linear Numpythonic approach, using np.packbits:
>>> np.packbits(np.pad((s0+1).astype(bool).astype(int), (8-s0.size, 0), 'constant'))
array([11], dtype=uint8)
And for reversing:
>>> unpack = (np.unpackbits(np.array([11], dtype=np.uint8))[-4:]).astype(float)
>>> unpack[unpack==0] = -1
>>> unpack
array([ 1., -1., 1., 1.])
And for 2d array:
>>> x, y = s.shape
>>> np.packbits(np.pad((s+1).astype(bool).astype(int), (8-y, 0), 'constant')[-2:])
array([5, 7], dtype=uint8)
And for reversing:
>>> unpack = (np.unpackbits(np.array([5, 7], dtype='uint8'))).astype(float).reshape(x, 8)[:,-y:]
>>> unpack[unpack==0] = -1
>>> unpack
array([[ 1., -1., 1.],
[ 1., 1., 1.]])

I'll start with sig2int.. Convert from a sign representation to binary
>>> a
array([ 1., -1., 1., -1.])
>>> (a + 1) / 2
array([ 1., 0., 1., 0.])
>>>
Then you can simply create an array of powers of two, multiply it by the binary and sum.
>>> powers = np.arange(a.shape[-1])[::-1]
>>> np.power(2, powers)
array([8, 4, 2, 1])
>>> a = (a + 1) / 2
>>> powers = np.power(2, powers)
>>> a * powers
array([ 8., 0., 2., 0.])
>>> np.sum(a * powers)
10.0
>>>
Then make it operate on rows by adding axis information and rely on broadcasting.
def sign2int(a):
# powers of two
powers = np.arange(a.shape[-1])[::-1]
np.power(2, powers, powers)
# sign to "binary" - add one and divide by two
np.add(a, 1, a)
np.divide(a, 2, a)
# scale by powers of two and sum
np.multiply(a, powers, a)
return np.sum(a, axis = -1)
>>> b = np.array([a, a, a, a, a])
>>> sign2int(b)
array([ 11., 11., 11., 11., 11.])
>>>
I tried it on a 4 by 100 bit array and it seemed fast
>>> a = a.repeat(100)
>>> b = np.array([a, a, a, a, a])
>>> b
array([[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.]])
>>> sign2int(b)
array([ 2.58224988e+120, 2.58224988e+120, 2.58224988e+120,
2.58224988e+120, 2.58224988e+120])
>>>
I'll add the reverse if i can figure it. - the best I could do relies on some plain Python without any numpy vectoriztion magic and I haven't figured how to make it work with a sequence of ints other than to iterate over them and convert them one at a time - but the time still seems acceptable.
def foo(n):
'''yields bits in increasing powers of two
bit sequence from lsb --> msb
'''
while n > 0:
n, r = divmod(n, 2)
yield r
def int2sign(n):
n = int(n)
a = np.fromiter(foo(n), dtype = np.int8, count = n.bit_length())
np.multiply(a, 2, a)
np.subtract(a, 1, a)
return a[::-1]
Works on 1324:
>>> bin(1324)
'0b10100101100'
>>> a = int2sign(1324)
>>> a
array([ 1, -1, 1, -1, -1, 1, -1, 1, 1, -1, -1], dtype=int8)
Seems to work with 1.2e305:
>>> n = int(1.2e305)
>>> n.bit_length()
1014
>>> a = int2sign(n)
>>> a.shape
(1014,)
>>> s = bin(n)
>>> s = s[2:]
>>> all(2 * int(x) -1 == y for x, y in zip(s, a))
True
>>>

Here are some vectorized versions of your functions:
def sign2int(s):
return int(''.join(np.where(s == -1., 0, s).astype(int).astype(str)), 2)
def int2sign(i, m):
tmp = np.array(list(bin(i)[2:].zfill(m)))
return np.where(tmp == "0", "-1", tmp).astype(int)
s0 = np.array([1., -1., 1., 1.])
sign2int(s0)
# 11
int2sign(11, 5)
# array([-1, 1, -1, 1, 1])
To use your functions on 2-d arrays, you can use map function:
s = np.array([[1., -1., 1.], [1., 1., 1.]])
map(sign2int, s)
# [5, 7]
map(lambda x: int2sign(x, 4), [5, 7])
# [array([-1, 1, -1, 1]), array([-1, 1, 1, 1])]

After a bit of testing, the Numpythonic approach of #wwii that doesn't use strings seems to fit what I need best. For the int2sign, I used a for-loop over the exponents with a standard algorithm for the conversion---which will have at most 64 iterations for 64-bit integers. Numpy's broadcasting happens across each integer very efficiently.
packbits and unpackbits are restricted to 8-bit integers; otherwise, I suspect that would've been the best (though I didn't try).
Here are the specific implementations I tested that follow the suggestions in the other answers (thanks to everyone!):
def _sign2int_str(s):
return int(''.join(np.where(s == -1., 0, s).astype(int).astype(str)), 2)
def sign2int_str(s):
return np.array(map(_sign2int_str, s))
def _int2sign_str(i, m):
tmp = np.array(list(bin(i)[2:])).astype(int)
return np.pad(np.where(tmp == 0, -1, tmp), (m - len(tmp), 0), "constant", constant_values = -1)
def int2sign_str(i,m):
return np.array(map(lambda x: _int2sign_str(x, m), i.astype(int).tolist())).transpose()
def sign2int_np(s):
p = np.arange(s.shape[-1])[::-1]
s = s + 1
return np.sum(np.power(s, p), axis = -1).astype(int)
def int2sign_np(i,m):
N = i.shape[-1]
S = np.zeros((m, N))
for k in range(m):
b = np.power(2, m - 1 - k).astype(int)
S[k,:] = np.divide(i.astype(int), b).astype(float)
i = np.mod(i, b)
S[S==0.] = -1.
return S
And here is my test:
X = np.sign(np.random.normal(size=(5000, 20)))
N = 100
t = time.time()
for i in range(N):
S = sign2int_np(X)
print 'sign2int_np: \t{:10.8f} sec'.format((time.time() - t)/N)
t = time.time()
for i in range(N):
S = sign2int_str(X)
print 'sign2int_str: \t{:10.8f} sec'.format((time.time() - t)/N)
m = 20
S = np.random.randint(0, high=np.power(2,m), size=(5000,))
t = time.time()
for i in range(N):
X = int2sign_np(S, m)
print 'int2sign_np: \t{:10.8f} sec'.format((time.time() - t)/N)
t = time.time()
for i in range(N):
X = int2sign_str(S, m)
print 'int2sign_str: \t{:10.8f} sec'.format((time.time() - t)/N)
This produced the following results:
sign2int_np: 0.00165325 sec
sign2int_str: 0.04121902 sec
int2sign_np: 0.00318024 sec
int2sign_str: 0.24846984 sec

I think numpy.packbits is worth another look. Given a real-valued sign array a, you can use numpy.packbits(a > 0). Decompression is done by numpy.unpackbits. This implicitly flattens multi-dimensional arrays so you'll need to reshape after unpackbits if you have a multi-dimensional array.
Note that you can combine bit packing with conventional compression (e.g., zlib or lzma). If there is a pattern or bias to your data, you may get a useful compression factor, but for unbiased random data, you'll typically see a moderate size increase.

Save one-hot-encoded features into Pandas DataFrame the fastest way

I have a Pandas DataFrame with all my features and labels. One of my feature is categorical and needs to be one-hot-encoded.
The feature is an integer and can only have values from 0 to 4
To save those arrays back in my DataFrame I use the following code
# enc is my OneHotEncoder object
df['mycol'] = df['mycol'].map(lambda x: enc.transform(x).toarray())
My DataFrame has more than 1 million rows so the above code takes a while.Is there a faster way to assign the arrays to the DataFrame cells? Because I have just 5 categories i dont need to call the transform() function 1 million times.
I already tried something like
num_categories = 5
i = 0
while (i<num_categories):
df.loc[df['mycol'] == i, 'mycol'] = enc.transform(i).toarray()
i += 1
Which yields this error
ValueError: Must have equal len keys and value when setting with an ndarray

You can use pd.get_dummies:
>>> s
0 a
1 b
2 c
3 a
dtype: object
>>> pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
Alternatively:
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> a = np.array([1, 1, 3, 2, 2]).reshape(-1, 1)
>>> a
array([[1],
[1],
[3],
[2],
[2]]
>>> one_hot = enc.fit_transform(a)
>>> one_hot.toarray()
array([[ 1., 0., 0.],
[ 1., 0., 0.],
[ 0., 0., 1.],
[ 0., 1., 0.],
[ 0., 1., 0.]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to transform this data for logistic regression? - python

Related

Conversion between R and Python(indexing issue?)

Finding determinant with torch.det doesn't return 0?

Encoding with OneHotEncoder

Python: convert numpy array of signs to int and back

Save one-hot-encoded features into Pandas DataFrame the fastest way

Categories

Resources