Sparse vectors for training data

Sparse vectors for training data - python

I have a training data like this:
x_train = np.random.randint(100, size=(1000, 25))
where each row is a sample and thus we have 1000 samples.
Now I need to have the training data such that for each of the sample/row there can be at max 3 non-zero elements out of 25.
Can you all please suggest how I can implement that? Thanks!

I am assuming that you want to turn a majority of your data into zeros, except that 0 to 3 non-zero elements are retained (randomly) for each row. If this is the case, a possible way to do this is as follows.
Code
import numpy as np
max_ = 3
nrows = 1000
ncols = 25
np.random.seed(7)
X = np.zeros((nrows,ncols))
data = np.random.randint(100, size=(nrows, ncols))
# number of max non-zeros to be generated for each column
vmax = np.random.randint(low=0, high=4, size=(nrows,))
for i in range(nrows):
if vmax[i]>0:
#index for setting non-zeros
col = np.random.randint(low=0, high=ncols, size=(1,vmax[i]))
#set non-zeros elements
X[i][col] = data[i][col]
print(X)
Output
[[ 0. 68. 25. ... 0. 0. 0.]
[ 0. 0. 0. ... 0. 0. 0.]
[ 0. 0. 0. ... 0. 0. 0.]
...
[ 0. 0. 0. ... 0. 0. 0.]
[88. 0. 0. ... 0. 0. 0.]
[ 0. 0. 0. ... 0. 0. 0.]]

Related

sklearn for LASSO (BIC tuning)

We encounter a problem when using the LASSO-related function in sklearn. Since the LASSO with BIC tuning just change the alpha, the results of LASSO with BIC (1) should be equivalent to the LASSO with fixed optimal alpha (2).
linear_model.LassoLarsIC
linear_model.Lasso
First, we could consider the simple DGP setting:
################## DGP ##################
np.random.seed(10)
T = 200 # sample size
p = 100 # number of regressors
X = np.random.normal(size = (T, p))
u = np.random.normal(size = T)
beta = np.hstack((np.array([5, 0, 3, 0, 1, 0, 0, 0, 0, 0]), np.zeros(p-10)))
y = np.dot(X, beta) + u
Then we use the LASSO with BIC. linear_model.LassoLarsIC
# LASSO with BIC
lasso = linear_model.LassoLarsIC(criterion='bic')
lasso.fit(X,y)
print("lasso coef = \n {}".format(lasso.coef_))
print("lasso optimal alpha = {}".format(lasso.alpha_))
lasso coef =
[ 4.81934044 0. 2.87574831 0. 0.90031582 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.01705965 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
-0.07789506 0. 0.05817856 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. ]
lasso optimal alpha = 0.010764484244859006
Then we use the optimal alpha here with LASSO. linear_model.Lasso
# LASSO with fixed alpha
clf = linear_model.Lasso(alpha=lasso.alpha_)
clf.fit(X,y)
print("lasso coef = \n {}".format(clf.coef_))
lasso coef =
[ 4.93513468e+00 5.42491624e-02 3.00412571e+00 -3.83394653e-02
9.87262697e-01 5.21693412e-03 -2.89977454e-02 -1.40952930e-01
5.18653123e-02 -7.66271662e-02 -1.99074552e-02 2.72228580e-02
-1.01217167e-01 -4.69445223e-02 1.74378470e-01 2.52655725e-02
1.84902632e-02 -7.11030674e-02 -4.15940817e-03 1.98229236e-02
-8.81779536e-02 -3.59094431e-02 5.53212537e-03 9.23031418e-02
1.21577471e-01 -4.73932893e-03 5.15459727e-02 4.17136419e-02
4.49561794e-02 -4.74874460e-03 0.00000000e+00 -3.56968194e-02
-4.43094631e-02 0.00000000e+00 1.00390051e-03 7.17980301e-02
-7.39058574e-02 1.73139031e-02 7.88996602e-02 1.04325618e-01
-4.10356303e-02 5.94564069e-02 0.00000000e+00 9.28354383e-02
0.00000000e+00 4.57453873e-02 0.00000000e+00 0.00000000e+00
-1.94113178e-02 1.97056365e-02 -1.17381604e-01 5.13943798e-02
2.11245596e-01 4.24124220e-02 1.16573094e-01 1.19551223e-02
-0.00000000e+00 -0.00000000e+00 -8.35210244e-02 -8.29230887e-02
-3.16409003e-02 8.43274240e-02 -2.90949577e-02 -0.00000000e+00
1.24697858e-01 -3.07120380e-02 -4.34558350e-02 -0.00000000e+00
1.30491858e-01 -2.04573808e-02 6.72141775e-02 -6.85563204e-02
5.64781612e-02 -7.43380132e-02 1.88610065e-01 -5.53155313e-04
0.00000000e+00 2.43191722e-02 9.10973250e-02 -4.49945551e-02
3.36006276e-02 -0.00000000e+00 -3.85862475e-02 -9.63711465e-02
-2.07015665e-01 8.67164869e-02 1.30776709e-01 -0.00000000e+00
5.42630086e-02 -1.44763258e-01 -0.00000000e+00 -3.29485283e-02
-2.35245212e-02 -6.19975427e-02 -8.83892134e-03 -1.60523703e-01
9.63008989e-02 -1.06953313e-01 4.60206741e-02 6.02880434e-02]
-0.06321829752708413
Two coefficients are different.
Why does this happen?

So the main difference I could find off the bat is the max_iter parameter, which is at 1000 with the Lasso model and at 500 with the LassoLarsIC model.
Other hyperparameters such as tol and selection are not adjustable in the LassoLarsIC implementation.
There might be more nuanced differences in the exact implementation of the two models though.

Is there a way to insert multiple elements to different locations in a ndarray all at once?

I'm using numpy's ndarray, and I'm wondering is there a way that allows me to insert multiple elements to different locations all at once?
For example, I have an image, and I want to pad the image with 0s. This is what I currently have:
def zero_padding(self):
padded = self.copy()
padded.img = np.insert(self.img, 0, 0, axis = 0)
padded.img = np.insert(padded.img, padded.img.shape[0], 0, axis = 0)
padded.img = np.insert(padded.img, 0, 0, axis = 1)
padded.img = np.insert(padded.img, padded.img.shape[1], 0, axis = 1)
return padded
where padded is an instance of the image.

Sure, you can use the fancy indexing techinque of NumPy as follows:
import numpy as np
if __name__=='__main__':
A = np.zeros((5, 5))
A[[1, 2], [0, 3]] = 1
print(A)
Output:
[[0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0.]
[0. 0. 0. 1. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]
Cheers

How to find minimum value in each row while keeping array dimensions same using numpy?

I've the following array:
np.array([[0.07704314, 0.46752589, 0.39533099, 0.35752864],
[0.45813299, 0.02914078, 0.65307364, 0.58732429],
[0.32757561, 0.32946822, 0.59821108, 0.45585825],
[0.49054429, 0.68553148, 0.26657932, 0.38495586]])
I want to find the minimum value in each row of the array. How can I achieve this?
Expected answer:
[[0.07704314 0. 0. 0. ]
[0. 0.02914078 0. 0. ]
[0.32757561 0 0. 0. ]
[0. 0. 0.26657932 0. ]]

You can use np.where like so:
np.where(a.argmin(1)[:,None]==np.arange(a.shape[1]), a, 0)
Or (more lines but potentially more efficient):
out = np.zeros_like(a)
idx = a.argmin(1)[:, None]
np.put_along_axis(out, idx, np.take_along_axis(a, idx, 1), 1)

IIUC first find out out the min value of each line , then we base on the min value mask all min value in original array as True, using multiple(matrix) , get what we need as result
np.multiply(a,a==np.min(a,1)[:,None])
Out[225]:
array([[0.07704314, 0. , 0. , 0. ],
[0. , 0.02914078, 0. , 0. ],
[0.32757561, 0. , 0. , 0. ],
[0. , 0. , 0.26657932, 0. ]])

np.amin(a, axis=1) where a is your np array

create 3D binary image

I have a 2D array, a, comprising a set of 100 x,y,z coordinates:
[[ 0.81 0.23 0.52]
[ 0.63 0.45 0.13]
...
[ 0.51 0.41 0.65]]
I would like to create a 3D binary image, b, with 101 pixels in each of the x,y,z dimensions, of coordinates ranging between 0.00 and 1.00.
Pixels at locations defined by a should take on a value of 1, all other pixels should have a value of 0.
I can create an array of zeros of the right shape with b = np.zeros((101,101,101)), but how do I assign coordinate and slice into it to create the ones using a?

First, start off by safely rounding your floats to ints. In context, see this question.
a_indices = np.rint(a * 100).astype(int)
Next, assign those indices in b to 1. But be careful to use an ordinary list instead of the array, or else you'll trigger the usage of index arrays. It seems as though performance of this method is comparable to that of alternatives (Thanks #Divakar! :-)
b[list(a_indices.T)] = 1
I created a small example with size 10 instead of 100, and 2 dimensions instead of 3, to illustrate:
>>> a = np.array([[0.8, 0.2], [0.6, 0.4], [0.5, 0.6]])
>>> a_indices = np.rint(a * 10).astype(int)
>>> b = np.zeros((10, 10))
>>> b[list(a_indices.T)] = 1
>>> print(b)
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[ 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

You could do something like this -
# Get the XYZ indices
idx = np.round(100 * a).astype(int)
# Initialize o/p array
b = np.zeros((101,101,101))
# Assign into o/p array based on linear index equivalents from indices array
np.put(b,np.ravel_multi_index(idx.T,b.shape),1)
Runtime on the assignment part -
Let's use a bigger grid for timing purposes.
In [82]: # Setup input and get indices array
...: a = np.random.randint(0,401,(100000,3))/400.0
...: idx = np.round(400 * a).astype(int)
...:
In [83]: b = np.zeros((401,401,401))
In [84]: %timeit b[list(idx.T)] = 1 ##Praveen soln
The slowest run took 42.16 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 6.28 ms per loop
In [85]: b = np.zeros((401,401,401))
In [86]: %timeit np.put(b,np.ravel_multi_index(idx.T,b.shape),1) # From this post
The slowest run took 45.34 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 5.71 ms per loop
In [87]: b = np.zeros((401,401,401))
In [88]: %timeit b[idx[:,0],idx[:,1],idx[:,2]] = 1 #Subscripted indexing
The slowest run took 40.48 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 6.38 ms per loop

Python creating matrix using if condition on indices : incorrect result

I have the following code where I have been trying to create a tridiagonal matrix x using if-conditions.
#!/usr/bin/env python
# import useful modules
import numpy as np
N=5
x=np.identity(N)
#x=np.zeros((N,N))
print x
# Construct NxN matrix
for i in range(N):
for j in range(N):
if i-j==1:
x[i][j]=1
elif j-1==1:
x[i][j]=-1
else:
x[i][j]=0
print "i= ",i," j= ",j
print x
I desire to get
[[ 0. -1. 0. 0. 0.]
[ 1. 0. -1. 0. 0.]
[ 0. 1. 0. -1 0.]
[ 0. 0. 1. 0. -1.]
[ 0. 0. 0. 1. 0.]]
However, I obtain
[[ 0. 0. -1. 0. 0.]
[ 1. 0. -1. 0. 0.]
[ 0. 1. -1. 0. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 0. -1. 1. 0.]]
What's going wrong?
Bonus question : Can I forcefully index from 1 to 5 instead of 0 to 4 in this example, or Python never allows that?

elif j-1==1: should be elif j-i==1:.
And no, lists/arrays etc. are always indexed from 0.

As for the bonus question, the first element of a sequence in Python has always the index 0. However, if for some particular reason (for example to prevent off-by-one errors) you wish to count the elements of a sequence from a value other than 0, you could use the built-in function enumerate() and set the value of the optional parameter start to fit your needs:
>>> seq = ['a', 'b', 'c']
>>> for count, item in enumerate(seq, start=1):
... print(count, item)
...
1 a
2 b
3 c

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sparse vectors for training data - python

Related

sklearn for LASSO (BIC tuning)

Is there a way to insert multiple elements to different locations in a ndarray all at once?

How to find minimum value in each row while keeping array dimensions same using numpy?

create 3D binary image

Python creating matrix using if condition on indices : incorrect result

Categories

Resources