Efficient manipulations of binary matrix in numpy - python

By binary matrix, I mean every element in the matrix is either 0 or 1, and I use the Matrix class in numpy for this.
First of all, is there a specific type of matrix in numpy for it, or do we simply use a matrix that is populated with 0s and 1s?
Second, what is the quickest way for creating a square matrix full of 0s given its dimension with the Matrix class? Note: numpy.zeros((dim, dim)) is not what I want, as it creates a 2-D array with float 0.
Third, I want to get and set any given row of the matrix frequently. For get, I can think of using row = my_matrix.A[row_index].tolist(), which will return a list representation of the given row. For set, it seems that I can just do my_matrix[row_index] = row_list, with row_list being a list of the same length as the given row. Again, I wonder whether they are the most efficient methods for doing the jobs.

To make a numpy array whose elements can be either 0 or 1, use the dtype = 'bool' parameter:
arr = np.zeros((dim,dim), dtype = 'bool')
Or, to convert arr to a numpy matrix:
arr = np.matrix(arr)
To access a row:
arr[row_num]
and to set a row:
arr[row_num] = new_row
is the quickest way.

Related

Numpy array apply a function only to some elements

I have a numpy array let's say that has a shape (10,10) for example.
Now i want to apply np.exp() to this array, but just to some specific elements that satisfy a condition. For example i want to apply np.exp to all the elements that are not 0 or 1. Is there a way to do that without using for loop that iterate on each element of the array?
This is achievable with basic numpy operations. Here is a way to do that :
A = np.random.randint(0,5,size=(10,10)).astype(float) # data
goods = (A!=0) & (A!=1) # 10 x 10 boolean array
A[goods] = np.exp(A[goods]) # boolean indexing

Cleanest way to replace np.array value with np.nan by user defined index

One question about mask 2-d np.array data.
For example:
one 2-d np.array value in the shape of 20 x 20.
An index t = [(1,2),(3,4),(5,7),(12,13)]
How to mask the 2-d array value by the (y,x) in index?
Usually, replacing with np.nan are based on the specific value like y[y==7] = np.nan
On my example, I want to replace the value specific location with np.nan.
For now, I can do it by:
Creating a new array value_mask in the shape of 20 x 20
Loop the value and testify the location by (i,j) == t[k]
If True, value_mask[i,j] = value[i,j] ; In verse, value_mask[i,j] = np.nan
My method was too bulky especially for hugh data(3 levels of loops).
Are there some efficiency method to achieve that? Any advice would be appreciate.
You are nearly there.
You can pass arrays of indices to arrays. You probably know this with 1D-arrays.
With a 2D-array you need to pass the array a tuple of lists (one tuple for each axis; one element in the lists (which have to be of equal length) for each array-element you want to chose). You have a list of tuples. So you have just to "transpose" it.
t1 = zip(*t)
gives you the right shape of your index array; which you can now use as index for any assignment, for example: value[t1] = np.NaN
(There are lots of nice explanation of this trick (with zip and *) in python tutorials, if you don't know it yet.)
You can use np.logical_and
arr = np.zeros((20,20))
You can select by location, this is just an example location.
arr[4:8,4:8] = 1
You can create a mask the same shape as arr
mask = np.ones((20,20)).astype(bool)
Then you can use the np.logical_and.
mask = np.logical_and(mask, arr == 1)
And finally, you can replace the 1s with the np.nan
arr[mask] = np.nan

scipy sparse matrix: remove the rows whose all elements are zero

I have a sparse matrix which is transformed from sklearn tfidfVectorier. I believe that some rows are all-zero rows. I want to remove them. However, as far as I know, the existing built-in functions, e.g. nonzero() and eliminate_zero(), focus on zero entries, rather than rows.
Is there any easy way to remove all-zero rows from a sparse matrix?
Example:
What I have now (actually in sparse format):
[ [0, 0, 0]
[1, 0, 2]
[0, 0, 1] ]
What I want to get:
[ [1, 0, 2]
[0, 0, 1] ]
Slicing + getnnz() does the trick:
M = M[M.getnnz(1)>0]
Works directly on csr_array.
You can also remove all 0 columns without changing formats:
M = M[:,M.getnnz(0)>0]
However if you want to remove both you need
M = M[M.getnnz(1)>0][:,M.getnnz(0)>0] #GOOD
I am not sure why but
M = M[M.getnnz(1)>0, M.getnnz(0)>0] #BAD
does not work.
There aren't existing functions for this, but it's not too bad to write your own:
def remove_zero_rows(M):
M = scipy.sparse.csr_matrix(M)
First, convert the matrix to CSR (compressed sparse row) format. This is important because CSR matrices store their data as a triple of (data, indices, indptr), where data holds the nonzero values, indices stores column indices, and indptr holds row index information. The docs explain better:
the column indices for row i are stored in
indices[indptr[i]:indptr[i+1]] and their corresponding values are
stored in data[indptr[i]:indptr[i+1]].
So, to find rows without any nonzero values, we can just look at successive values of M.indptr. Continuing our function from above:
num_nonzeros = np.diff(M.indptr)
return M[num_nonzeros != 0]
The second benefit of CSR format here is that it's relatively cheap to slice rows, which simplifies the creation of the resulting matrix.
Thanks for your reply, #perimosocordiae
I just find another solution by myself. I am posting here in case someone may need it in the future.
def remove_zero_rows(X)
# X is a scipy sparse matrix. We want to remove all zero rows from it
nonzero_row_indice, _ = X.nonzero()
unique_nonzero_indice = numpy.unique(nonzero_row_indice)
return X[unique_nonzero_indice]

Copying row element in a numpy array

I have an array X of <class 'scipy.sparse.csr.csr_matrix'> format with shape (44, 4095)
I would like to now to create a new numpy array say X_train = np.empty([44, 4095]) and copy row by row in a different order. Say I want the 5th row of X in 1st row of X_train.
How do I do this (copying an entire row into a new numpy array) similar to matlab?
Define the new row order as a list of indices, then define X_train using integer indexing:
row_order = [4, ...]
X_train = X[row_order]
Note that unlike Matlab, Python uses 0-based indexing, so the 5th row has index 4.
Also note that integer indexing (due to its ability to select values in arbitrary order) returns a copy of the original NumPy array.
This works equally well for sparse matrices and NumPy arrays.
Python works generally by reference, which is something you should keep in mind. What you need to do is make a copy and then swap. I have written a demo function which swaps rows.
import numpy as np # import numpy
''' Function which swaps rowA with rowB '''
def swapRows(myArray, rowA, rowB):
temp = myArray[rowA,:].copy() # create a temporary variable
myArray[rowA,:] = myArray[rowB,:].copy()
myArray[rowB,:]= temp
a = np.arange(30) # generate demo data
a = a.reshape(6,5) # reshape the data into 6x5 matrix
print a # prin the matrix before the swap
swapRows(a,0,1) # swap the rows
print a # print the matrix after the swap
To answer your question, one solution would be to use
X_train = np.empty([44, 4095])
X_train[0,:] = x[4,:].copy() # store in the 1st row the 5th one
unutbu answer seems to be the most logical.
Kind Regards,

Select specefic rows from a 2d Numpy array using a sparse binary 1-d array

I am having a issues figuring out to do this operation
So I have and the variable index 1xM sparse binary array and I have a 2-d array (NxM) samples. I want to use index to select specific rows of samples adnd get a 2-d array.
I have tried stuff like:
idx = index.todense() == 1
samples[idx.T,:]
but nothing.
So far I have made it work doing this:
idx = test_x.todense() == 1
selected_samples = samples[np.array(idx.flat)]
But there should be a cleaner way.
To give an idea using a fraction of the data:
print(idx.shape) # (1, 22360)
print(samples.shape) (22360, 200)
The short answer:
selected_samples = samples[index.nonzero()[1]]
The long answer:
The first problem is that your index matrix is 1xN while your sample ndarray is NxM. (See the mismatch?) This is why you needed to call .flat.
That's not a big deal, though, because we just need the nonzero entries in the sparse vector. Get those with index.nonzero(), which returns a tuple of (row indices, column indices). We only care about the column indices, so we use index.nonzero()[1] to get those by themselves.
Then, simply index with the array of nonzero column indices and you're done.

Categories

Resources