Efficiently permute array in row wise using Numpy - python

Given a 2D array, I would like to permute this array row-wise.
Currently, I will create a for loop to permute the 2D array row by row as below:
for i in range(npart):
pr=np.random.permutation(range(m))
# arr_rand3 is the same as arr, but with each row permuted
arr_rand3[i,:]=arr[i,pr]
But, I wonder whether there is some setting within Numpy that can perform this in single line (without the for-loop).
The full code is
import numpy as np
arr=np.array([[0,0,0,0,0],[0,4,1,1,1],[0,1,1,2,2],[0,3,2,2,2]])
npart=len(arr[:,0])
m=len(arr[0,:])
# Permuted version of arr
arr_rand3=np.zeros(shape=np.shape(arr),dtype=np.int)
# Nodal association matrix for C
X=np.zeros(shape=(m,m),dtype=np.double)
# Random nodal association matrix for C_rand3
X_rand3=np.zeros(shape=(m,m),dtype=np.double)
for i in range(npart):
pr=np.random.permutation(range(m))
# arr_rand3 is the same as arr, but with each row permuted
arr_rand3[i,:]=arr[i,pr]

In Numpy 1.19+ you should be able to do:
import numpy as np
arr = np.array([[0, 0, 0, 0, 0], [0, 4, 1, 1, 1], [0, 1, 1, 2, 2], [0, 3, 2, 2, 2]])
rng = np.random.default_rng()
arr_rand3 = rng.permutation(arr, axis=1)
print(arr_rand3)
Output
[[0 0 0 0 0]
[4 0 1 1 1]
[1 0 1 2 2]
[3 0 2 2 2]]
According to the documentation, the method random.Generator.permutation receives a new parameter axis:
axis int, optional
The axis which x is shuffled along. Default is 0.

Related

Efficiently convert Numpy 2D array of counts to zero-padded 2D array of indices?

I have a numpy 2D array of n rows (observations) X m columns (features), where each element is the count of times that feature was observed. I need to convert it to a zero-padded 2D array of feature_indices, where each feature_index is repeated a number of times corresponding to the 'count' in the original 2D array.
This seems like it should be a simple combo of np.where with np.repeat or just expansion using indexing, but I'm not seeing it. Here's a very slow, loopy solution (way too slow to use in practice):
# Loopy solution (way too slow!)
def convert_2Dcountsarray_to_zeropaddedindices(countsarray2D):
rowsums = np.sum(countsarray2D,1)
max_rowsum = np.max(rowsums)
out = []
for row_idx, row in enumerate(countsarray2D):
out_row = [0]*int(max_rowsum - rowsums[row_idx]) #Padding zeros so all out_rows same length
for ele_idx in range(len(row)):
[out_row.append(x) for x in np.repeat(ele_idx, row[ele_idx]) ]
out.append(out_row)
return np.array(out)
# Working example
countsarray2D = np.array( [[1,2,0,1,3],
[0,0,0,0,3],
[0,1,1,0,0]] )
# Shift all features up by 1 (i.e. add a dummy feature 0 we will use for padding)
countsarray2D = np.hstack( (np.zeros((len(countsarray2D),1)), countsarray2D) )
print(convert_2Dcountsarray_to_zeropaddedindices(countsarray2D))
# Desired result:
array([[1 2 2 4 5 5 5]
[0 0 0 0 5 5 5]
[0 0 0 0 0 2 3]])
One solution would be to flatten the array and use np.repeat.
This solution requires first adding the number of zeros to use as padding for each row to countsarray2D. This can be done as follows:
counts = countsarray2D.sum(axis=1)
max_count = max(counts)
zeros_to_add = max_count - counts
countsarray2D = np.c_[zeros_to_add, countsarray2D]
The new countsarray2D is then:
array([[0, 1, 2, 0, 1, 3],
[4, 0, 0, 0, 0, 3],
[5, 0, 1, 1, 0, 0]])
Now, we can flatten the array and use np.repeat. An index array A is used as the input array while countsarray2D determines the number of times each index value should be repeated.
n_rows, n_cols = countsarray2D.shape
A = np.tile(np.arange(n_cols), (n_rows, 1))
np.repeat(A, countsarray2D.flatten()).reshape(n_rows, -1)
Final result:
array([[1, 2, 2, 4, 5, 5, 5],
[0, 0, 0, 0, 5, 5, 5],
[0, 0, 0, 0, 0, 2, 3]])

continuous to categorical 2D array

I want to convert a continuous 2D numpy array to categories based on thresholds. When I use the pandas cut function I first have to flatten to a 1D array and then use cut, but the output will not reshape back to 2D with the numpy reshape function.
Here is a simple example:
import numpy as np
import pandas as pd
a = np.random.rand(2,3)
print(a)
b = a.flatten()
print(b)
c = pd.cut(b,(0,0.5,1),labels=[0,1])
print(c)
d = np.reshape(c,(2,3))
print(d)
The output is
[[ 0.56887807 0.1368459 0.34892358]
[ 0.77157277 0.64827644 0.42259086]]
[ 0.56887807 0.1368459 0.34892358 0.77157277 0.64827644 0.42259086]
[1, 0, 0, 1, 1, 0]
Categories (2, int64): [0 < 1]
[1, 0, 0, 1, 1, 0]
Categories (2, int64): [0 < 1]
The d array remains 1D even after the reshape command. How can I reshape it back to 2D?
If you are not tied to use pandas' Categorical features you can simply use np.digitize to directly convert the 2D array into categorical (integer) values:
Applied to the simple example:
c = np.digitize(a, bins=(0.5, 1))
print(c)
# [[1 0 0]
# [1 1 0]]

numpy 3D array shape

I created a numpy array of shape (4,3,2); I expected following code can print out a array shaped 4 X 3 or 3 X 4
import numpy as np
x = np.zeros((4,3,2), np.int32)
print(x[:][:][0])
However, I got
[[0 0]
[0 0]
[0 0]]
Looks like a 2 X 3? I am really confused on numpy matrix now. Shouldn't I get
[[0 0 0]
[0 0 0]
[0 0 0]
[0 0 0]]
in stead? How to do map a 3D image into a numpy 3D matrix?
For example, in MATLAB, the shape (m, n, k) means (row, col, slice) in a context of an (2D/3D) image.
Thanks a lot
x[:] slices all elements along the first dimension, so x[:] gives the same result as x and x[:][:][0] is thus equivalent to x[0].
To select an element on the last dimension, you can do:
x[..., 0]
#array([[0, 0, 0],
# [0, 0, 0],
# [0, 0, 0],
# [0, 0, 0]], dtype=int32)
or:
x[:,:,0]
#array([[0, 0, 0],
# [0, 0, 0],
# [0, 0, 0],
# [0, 0, 0]], dtype=int32)
You need to specify all slices at the same time in a tuple, like so:
x[:, :, 0]
If you do x[:][:][0] you are actually indexing the first dimension three times. The first two create a view for the entire array and the third creates a view for the index 0 of the first dimension

get indicies of non-zero elements of 2D array

From Getting indices of both zero and nonzero elements in array, I can get indicies of non-zero elements in a 1 D array in numpy like this:
indices_nonzero = numpy.arange(len(array))[~bindices_zero]
Is there a way to extend it to a 2D array?
You can use numpy.nonzero
The following code is self-explanatory
import numpy as np
A = np.array([[1, 0, 1],
[0, 5, 1],
[3, 0, 0]])
nonzero = np.nonzero(A)
# Returns a tuple of (nonzero_row_index, nonzero_col_index)
# That is (array([0, 0, 1, 1, 2]), array([0, 2, 1, 2, 0]))
nonzero_row = nonzero[0]
nonzero_col = nonzero[1]
for row, col in zip(nonzero_row, nonzero_col):
print("A[{}, {}] = {}".format(row, col, A[row, col]))
"""
A[0, 0] = 1
A[0, 2] = 1
A[1, 1] = 5
A[1, 2] = 1
A[2, 0] = 3
"""
You can even do this
A[nonzero] = -100
print(A)
"""
[[-100 0 -100]
[ 0 -100 -100]
[-100 0 0]]
"""
Other variations
np.where(array)
It is equivalent to np.nonzero(array)
But, np.nonzero is preferred because its name is clear
np.argwhere(array)
It's equivalent to np.transpose(np.nonzero(array))
print(np.argwhere(A))
"""
[[0 0]
[0 2]
[1 1]
[1 2]
[2 0]]
"""
A = np.array([[1, 0, 1],
[0, 5, 1],
[3, 0, 0]])
np.stack(np.nonzero(A), axis=-1)
array([[0, 0],
[0, 2],
[1, 1],
[1, 2],
[2, 0]])
np.nonzero returns a tuple of arrays, one for each dimension of a, containing the indices of the non-zero elements in that dimension.
https://docs.scipy.org/doc/numpy/reference/generated/numpy.nonzero.html
np.stack joins this tuple array along a new axis. In our case, the innermost axis also known as the last axis (denoted by -1).
The axis parameter specifies the index of the new axis in the dimensions of the result. For example, if axis=0 it will be the first dimension and if axis=-1 it will be the last dimension.
New in version 1.10.0.
https://docs.scipy.org/doc/numpy/reference/generated/numpy.stack.html

Sum over rows in scipy.sparse.csr_matrix

I have a big csr_matrix and I want to add over rows and obtain a new csr_matrix with the same number of columns but reduced number of rows. (Context: The matrix is a document-term matrix obtained from sklearn CountVectorizer and I want to be able to quickly combine documents according to codes associated with these documents)
For a minimal example, this is my matrix:
import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse import vstack
row = np.array([0, 4, 1, 3, 2])
col = np.array([0, 2, 2, 0, 1])
dat = np.array([1, 2, 3, 4, 5])
A = csr_matrix((dat, (row, col)), shape=(5, 5))
print A.toarray()
[[1 0 0 0 0]
[0 0 3 0 0]
[0 5 0 0 0]
[4 0 0 0 0]
[0 0 2 0 0]]
No let's say I want a new matrix B in which rows (1, 4) and (2, 3, 5) are combined by summing them, which would look something like this:
[[5 0 0 0 0]
[0 5 5 0 0]]
And should be again in sparse format (because the real data I'm working with is large). I tried to sum over slices of the matrix and then stack it:
idx1 = [1, 4]
idx2 = [2, 3, 5]
A_sub1 = A[idx1, :].sum(axis=1)
A_sub2 = A[idx2, :].sum(axis=1)
B = vstack((A_sub1, A_sub2))
But this gives me the summed up values just for the non-zero columns in the slice, so I can't combine it with the other slices because the number of columns in the summed slices are different.
I feel like there must be an easy way to do this. But I couldn't find any discussion of this online or in the documentation. What am I missing?
Thank you for your help
Note that you can do this by carefully constructing another matrix. Here's how it would work for a dense matrix:
>>> S = np.array([[1, 0, 0, 1, 0,], [0, 1, 1, 0, 1]])
>>> np.dot(S, A.toarray())
array([[5, 0, 0, 0, 0],
[0, 5, 5, 0, 0]])
>>>
The sparse version is only a little more complicated. The information about which rows should be summed together is encoded in row:
col = range(5)
row = [0, 1, 1, 0, 1]
dat = [1, 1, 1, 1, 1]
S = csr_matrix((dat, (row, col)), shape=(2, 5))
result = S * A
# check that the result is another sparse matrix
print type(result)
# check that the values are the ones we want
print result.toarray()
Output:
<class 'scipy.sparse.csr.csr_matrix'>
[[5 0 0 0 0]
[0 5 5 0 0]]
You can handle more rows in your output by including higher values in row and extending the shape of S accordingly.
The indexing should be:
idx1 = [0, 3] # rows 1 and 4
idx2 = [1, 2, 4] # rows 2,3 and 5
Then you need to keep A_sub1 and A_sub2 in sparse format and use axis=0:
A_sub1 = csr_matrix(A[idx1, :].sum(axis=0))
A_sub2 = csr_matrix(A[idx2, :].sum(axis=0))
B = vstack((A_sub1, A_sub2))
B.toarray()
array([[5, 0, 0, 0, 0],
[0, 5, 5, 0, 0]])
Note, I think the A[idx, :].sum(axis=0) operations involve conversion from sparse matrices - so #Mr_E's answer is probably better.
Alternatively, it works when you use axis=0 and np.vstack (as opposed to scipy.sparse.vstack):
A_sub1 = A[idx1, :].sum(axis=0)
A_sub2 = A[idx2, :].sum(axis=0)
np.vstack((A_sub1, A_sub2))
Giving:
matrix([[5, 0, 0, 0, 0],
[0, 5, 5, 0, 0]])

Categories

Resources