sklearn for LASSO (BIC tuning) - python

We encounter a problem when using the LASSO-related function in sklearn. Since the LASSO with BIC tuning just change the alpha, the results of LASSO with BIC (1) should be equivalent to the LASSO with fixed optimal alpha (2).
linear_model.LassoLarsIC
linear_model.Lasso
First, we could consider the simple DGP setting:
################## DGP ##################
np.random.seed(10)
T = 200 # sample size
p = 100 # number of regressors
X = np.random.normal(size = (T, p))
u = np.random.normal(size = T)
beta = np.hstack((np.array([5, 0, 3, 0, 1, 0, 0, 0, 0, 0]), np.zeros(p-10)))
y = np.dot(X, beta) + u
Then we use the LASSO with BIC. linear_model.LassoLarsIC
# LASSO with BIC
lasso = linear_model.LassoLarsIC(criterion='bic')
lasso.fit(X,y)
print("lasso coef = \n {}".format(lasso.coef_))
print("lasso optimal alpha = {}".format(lasso.alpha_))
lasso coef =
[ 4.81934044 0. 2.87574831 0. 0.90031582 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.01705965 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
-0.07789506 0. 0.05817856 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. ]
lasso optimal alpha = 0.010764484244859006
Then we use the optimal alpha here with LASSO. linear_model.Lasso
# LASSO with fixed alpha
clf = linear_model.Lasso(alpha=lasso.alpha_)
clf.fit(X,y)
print("lasso coef = \n {}".format(clf.coef_))
lasso coef =
[ 4.93513468e+00 5.42491624e-02 3.00412571e+00 -3.83394653e-02
9.87262697e-01 5.21693412e-03 -2.89977454e-02 -1.40952930e-01
5.18653123e-02 -7.66271662e-02 -1.99074552e-02 2.72228580e-02
-1.01217167e-01 -4.69445223e-02 1.74378470e-01 2.52655725e-02
1.84902632e-02 -7.11030674e-02 -4.15940817e-03 1.98229236e-02
-8.81779536e-02 -3.59094431e-02 5.53212537e-03 9.23031418e-02
1.21577471e-01 -4.73932893e-03 5.15459727e-02 4.17136419e-02
4.49561794e-02 -4.74874460e-03 0.00000000e+00 -3.56968194e-02
-4.43094631e-02 0.00000000e+00 1.00390051e-03 7.17980301e-02
-7.39058574e-02 1.73139031e-02 7.88996602e-02 1.04325618e-01
-4.10356303e-02 5.94564069e-02 0.00000000e+00 9.28354383e-02
0.00000000e+00 4.57453873e-02 0.00000000e+00 0.00000000e+00
-1.94113178e-02 1.97056365e-02 -1.17381604e-01 5.13943798e-02
2.11245596e-01 4.24124220e-02 1.16573094e-01 1.19551223e-02
-0.00000000e+00 -0.00000000e+00 -8.35210244e-02 -8.29230887e-02
-3.16409003e-02 8.43274240e-02 -2.90949577e-02 -0.00000000e+00
1.24697858e-01 -3.07120380e-02 -4.34558350e-02 -0.00000000e+00
1.30491858e-01 -2.04573808e-02 6.72141775e-02 -6.85563204e-02
5.64781612e-02 -7.43380132e-02 1.88610065e-01 -5.53155313e-04
0.00000000e+00 2.43191722e-02 9.10973250e-02 -4.49945551e-02
3.36006276e-02 -0.00000000e+00 -3.85862475e-02 -9.63711465e-02
-2.07015665e-01 8.67164869e-02 1.30776709e-01 -0.00000000e+00
5.42630086e-02 -1.44763258e-01 -0.00000000e+00 -3.29485283e-02
-2.35245212e-02 -6.19975427e-02 -8.83892134e-03 -1.60523703e-01
9.63008989e-02 -1.06953313e-01 4.60206741e-02 6.02880434e-02]
-0.06321829752708413
Two coefficients are different.
Why does this happen?

So the main difference I could find off the bat is the max_iter parameter, which is at 1000 with the Lasso model and at 500 with the LassoLarsIC model.
Other hyperparameters such as tol and selection are not adjustable in the LassoLarsIC implementation.
There might be more nuanced differences in the exact implementation of the two models though.

Related

Sparse vectors for training data

I have a training data like this:
x_train = np.random.randint(100, size=(1000, 25))
where each row is a sample and thus we have 1000 samples.
Now I need to have the training data such that for each of the sample/row there can be at max 3 non-zero elements out of 25.
Can you all please suggest how I can implement that? Thanks!
I am assuming that you want to turn a majority of your data into zeros, except that 0 to 3 non-zero elements are retained (randomly) for each row. If this is the case, a possible way to do this is as follows.
Code
import numpy as np
max_ = 3
nrows = 1000
ncols = 25
np.random.seed(7)
X = np.zeros((nrows,ncols))
data = np.random.randint(100, size=(nrows, ncols))
# number of max non-zeros to be generated for each column
vmax = np.random.randint(low=0, high=4, size=(nrows,))
for i in range(nrows):
if vmax[i]>0:
#index for setting non-zeros
col = np.random.randint(low=0, high=ncols, size=(1,vmax[i]))
#set non-zeros elements
X[i][col] = data[i][col]
print(X)
Output
[[ 0. 68. 25. ... 0. 0. 0.]
[ 0. 0. 0. ... 0. 0. 0.]
[ 0. 0. 0. ... 0. 0. 0.]
...
[ 0. 0. 0. ... 0. 0. 0.]
[88. 0. 0. ... 0. 0. 0.]
[ 0. 0. 0. ... 0. 0. 0.]]

How to find minimum value in each row while keeping array dimensions same using numpy?

I've the following array:
np.array([[0.07704314, 0.46752589, 0.39533099, 0.35752864],
[0.45813299, 0.02914078, 0.65307364, 0.58732429],
[0.32757561, 0.32946822, 0.59821108, 0.45585825],
[0.49054429, 0.68553148, 0.26657932, 0.38495586]])
I want to find the minimum value in each row of the array. How can I achieve this?
Expected answer:
[[0.07704314 0. 0. 0. ]
[0. 0.02914078 0. 0. ]
[0.32757561 0 0. 0. ]
[0. 0. 0.26657932 0. ]]
You can use np.where like so:
np.where(a.argmin(1)[:,None]==np.arange(a.shape[1]), a, 0)
Or (more lines but potentially more efficient):
out = np.zeros_like(a)
idx = a.argmin(1)[:, None]
np.put_along_axis(out, idx, np.take_along_axis(a, idx, 1), 1)
IIUC first find out out the min value of each line , then we base on the min value mask all min value in original array as True, using multiple(matrix) , get what we need as result
np.multiply(a,a==np.min(a,1)[:,None])
Out[225]:
array([[0.07704314, 0. , 0. , 0. ],
[0. , 0.02914078, 0. , 0. ],
[0.32757561, 0. , 0. , 0. ],
[0. , 0. , 0.26657932, 0. ]])
np.amin(a, axis=1) where a is your np array

Error when Coding Perceptron: ValueError: shapes (124,124) and (1,10) not aligned: 124 (dim 1) != 1 (dim 0)

I'm trying to code a Multi-Layer Perceptron, but it seems I get it wrong when I'm trying to import data from csv file using genfromtxt function from numpy library.
from numpy import genfromtxt
dfX = genfromtxt('C:/Users/m15x/Desktop/UFABC/PDPD/inputX(editado_bits).csv', delimiter=',')
dfy = genfromtxt('C:/Users/m15x/Desktop/UFABC/PDPD/inputY(editado_bits).csv', delimiter=',')
X = dfX
y = dfy
print(X)
print(y)
# Whole Class with additions:
class Neural_Network(object):
def _init_(self):
# Define Hyperparameters
self.inputLayerSize = 26
self.outputLayerSize = 1
self.hiddenLayerSize = 10
# Weights (parameters)
self.W1 = np.random.randn(self.inputLayerSize, self.hiddenLayerSize)
self.W2 = np.random.randn(self.hiddenLayerSize, self.outputLayerSize)
And my X (124,1) and y (124,26) are the following arrays respectively:
[[ 1. 0. 1. ..., 1. 0. 0.]
[ 0. 1. 1. ..., 1. 0. 0.]
[ 0. 1. 1. ..., 1. 0. 0.]
...,
[ 0. 1. 1. ..., 1. 0. 0.]
[ 1. 0. 1. ..., 1. 0. 0.]
[ 1. 0. 1. ..., 1. 0. 0.]]
[ 0. 0. 1. 0. 1. 0. 1. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0.
0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0.
0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0.
0. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 1. 0. 1. 0. 1.
1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0.
1. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0.]
And I get notified with:
Traceback (most recent call last):
File "C:/Users/m15x/PycharmProjects/Deep Learning/MLP_tinnitus_1.py", line 141, in <module>
T.train(X,y)
File "C:/Users/m15x/PycharmProjects/Deep Learning/MLP_tinnitus_1.py", line 134, in train
args=(X, y), options=options, callback=self.callbackF)
File "C:\Users\m15x\Anaconda3\lib\site-packages\scipy\optimize\_minimize.py", line 444, in minimize
return _minimize_bfgs(fun, x0, args, jac, callback, **options)
File "C:\Users\m15x\Anaconda3\lib\site-packages\scipy\optimize\optimize.py", line 913, in _minimize_bfgs
gfk = myfprime(x0)
File "C:\Users\m15x\Anaconda3\lib\site-packages\scipy\optimize\optimize.py", line 292, in function_wrapper
return function(*(wrapper_args + args))
File "C:\Users\m15x\Anaconda3\lib\site-packages\scipy\optimize\optimize.py", line 71, in derivative
self(x, *args)
File "C:\Users\m15x\Anaconda3\lib\site-packages\scipy\optimize\optimize.py", line 63, in _call_
fg = self.fun(x, *args)
File "C:/Users/m15x/PycharmProjects/Deep Learning/MLP_tinnitus_1.py", line 119, in costFunctionWrapper
grad = self.N.computeGradients(X, y)
File "C:/Users/m15x/PycharmProjects/Deep Learning/MLP_tinnitus_1.py", line 76, in computeGradients
dJdW1, dJdW2 = self.costFunctionPrime(X, y)
File "C:/Users/m15x/PycharmProjects/Deep Learning/MLP_tinnitus_1.py", line 56, in costFunctionPrime
delta2 = np.dot(delta3, self.W2.T) * self.sigmoidPrime(self.z2)
ValueError: shapes (124,124) and (1,10) not aligned: 124 (dim 1) != 1 (dim 0)
And mainly this error starts when I'm trying to train my code with the previous data.
def train(self, X, y):
# Make an internal variable for the callback function:
self.X = X
self.y = y
# Make empty list to store costs:
self.J = []
params0 = self.N.getParams()
options = {'maxiter': 10000, 'disp': True}
_res = optimize.minimize(self.costFunctionWrapper, params0, jac=True, method='BFGS', \
args=(X, y), options=options, callback=self.callbackF)
self.N.setParams(_res.x)
self.optimizationResults = _res
I know my array from X and y doens't fit, but I don't know some usable function that I can apply to treat the data for the variable y, which is fed by the (124,1) shape data csv file ('C:/Users/m15x/Desktop/UFABC/PDPD/inputY(editado_bits).csv') and my X variable is fed by a (124,26) shape csv file ('C:/Users/m15x/Desktop/UFABC/PDPD/inputX(editado_bits).csv').
It seems my data imported using genfromtxt function doesn't seem appropriate.

create 3D binary image

I have a 2D array, a, comprising a set of 100 x,y,z coordinates:
[[ 0.81 0.23 0.52]
[ 0.63 0.45 0.13]
...
[ 0.51 0.41 0.65]]
I would like to create a 3D binary image, b, with 101 pixels in each of the x,y,z dimensions, of coordinates ranging between 0.00 and 1.00.
Pixels at locations defined by a should take on a value of 1, all other pixels should have a value of 0.
I can create an array of zeros of the right shape with b = np.zeros((101,101,101)), but how do I assign coordinate and slice into it to create the ones using a?
First, start off by safely rounding your floats to ints. In context, see this question.
a_indices = np.rint(a * 100).astype(int)
Next, assign those indices in b to 1. But be careful to use an ordinary list instead of the array, or else you'll trigger the usage of index arrays. It seems as though performance of this method is comparable to that of alternatives (Thanks #Divakar! :-)
b[list(a_indices.T)] = 1
I created a small example with size 10 instead of 100, and 2 dimensions instead of 3, to illustrate:
>>> a = np.array([[0.8, 0.2], [0.6, 0.4], [0.5, 0.6]])
>>> a_indices = np.rint(a * 10).astype(int)
>>> b = np.zeros((10, 10))
>>> b[list(a_indices.T)] = 1
>>> print(b)
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[ 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
You could do something like this -
# Get the XYZ indices
idx = np.round(100 * a).astype(int)
# Initialize o/p array
b = np.zeros((101,101,101))
# Assign into o/p array based on linear index equivalents from indices array
np.put(b,np.ravel_multi_index(idx.T,b.shape),1)
Runtime on the assignment part -
Let's use a bigger grid for timing purposes.
In [82]: # Setup input and get indices array
...: a = np.random.randint(0,401,(100000,3))/400.0
...: idx = np.round(400 * a).astype(int)
...:
In [83]: b = np.zeros((401,401,401))
In [84]: %timeit b[list(idx.T)] = 1 ##Praveen soln
The slowest run took 42.16 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 6.28 ms per loop
In [85]: b = np.zeros((401,401,401))
In [86]: %timeit np.put(b,np.ravel_multi_index(idx.T,b.shape),1) # From this post
The slowest run took 45.34 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 5.71 ms per loop
In [87]: b = np.zeros((401,401,401))
In [88]: %timeit b[idx[:,0],idx[:,1],idx[:,2]] = 1 #Subscripted indexing
The slowest run took 40.48 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 6.38 ms per loop

Python creating matrix using if condition on indices : incorrect result

I have the following code where I have been trying to create a tridiagonal matrix x using if-conditions.
#!/usr/bin/env python
# import useful modules
import numpy as np
N=5
x=np.identity(N)
#x=np.zeros((N,N))
print x
# Construct NxN matrix
for i in range(N):
for j in range(N):
if i-j==1:
x[i][j]=1
elif j-1==1:
x[i][j]=-1
else:
x[i][j]=0
print "i= ",i," j= ",j
print x
I desire to get
[[ 0. -1. 0. 0. 0.]
[ 1. 0. -1. 0. 0.]
[ 0. 1. 0. -1 0.]
[ 0. 0. 1. 0. -1.]
[ 0. 0. 0. 1. 0.]]
However, I obtain
[[ 0. 0. -1. 0. 0.]
[ 1. 0. -1. 0. 0.]
[ 0. 1. -1. 0. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 0. -1. 1. 0.]]
What's going wrong?
Bonus question : Can I forcefully index from 1 to 5 instead of 0 to 4 in this example, or Python never allows that?
elif j-1==1: should be elif j-i==1:.
And no, lists/arrays etc. are always indexed from 0.
As for the bonus question, the first element of a sequence in Python has always the index 0. However, if for some particular reason (for example to prevent off-by-one errors) you wish to count the elements of a sequence from a value other than 0, you could use the built-in function enumerate() and set the value of the optional parameter start to fit your needs:
>>> seq = ['a', 'b', 'c']
>>> for count, item in enumerate(seq, start=1):
... print(count, item)
...
1 a
2 b
3 c

Categories

Resources