continuous to categorical 2D array - python

I want to convert a continuous 2D numpy array to categories based on thresholds. When I use the pandas cut function I first have to flatten to a 1D array and then use cut, but the output will not reshape back to 2D with the numpy reshape function.
Here is a simple example:
import numpy as np
import pandas as pd
a = np.random.rand(2,3)
print(a)
b = a.flatten()
print(b)
c = pd.cut(b,(0,0.5,1),labels=[0,1])
print(c)
d = np.reshape(c,(2,3))
print(d)
The output is
[[ 0.56887807 0.1368459 0.34892358]
[ 0.77157277 0.64827644 0.42259086]]
[ 0.56887807 0.1368459 0.34892358 0.77157277 0.64827644 0.42259086]
[1, 0, 0, 1, 1, 0]
Categories (2, int64): [0 < 1]
[1, 0, 0, 1, 1, 0]
Categories (2, int64): [0 < 1]
The d array remains 1D even after the reshape command. How can I reshape it back to 2D?

If you are not tied to use pandas' Categorical features you can simply use np.digitize to directly convert the 2D array into categorical (integer) values:
Applied to the simple example:
c = np.digitize(a, bins=(0.5, 1))
print(c)
# [[1 0 0]
# [1 1 0]]

Related

Get Numpy ndarray value from list of nd points

How can I obtain value of a ndarray from a list that contains coordinates of a n-D point as efficient as possible.
Here an implementation for 3D :
1 arr = np.array([[[0, 1]]])
2 points = [[0, 0, 1], [0, 0, 0]]
3 values = []
4 for point in points:
5 x, y, z = point
6 value.append(arr[x, y, z])
7 # values -> [1, 0]
If this is not possible, is there a way to generalize lines 5-6 to nD?
I am sure there is way to achieve this using fancy indexing. Here is a way to do without the for-loop:
arr = np.array([[[0, 1]]])
points = np.array([[0, 0, 1], [0, 0, 0]])
x,y,z = np.split(points, 3, axis=1)
arr[x,y,z]
output (values):
array([[1],
[0]])
Alternatively, you could use tuple unpacking as suggested by the comment:
arr[(*points.T,)]
output:
array([1, 0])
Based on the Numpy documentation for indexing, you can easily do that, as long as you use tuples instead of lists:
arr = np.array([[[0, 1]]])
points = [(0, 0, 1), (0, 0, 0)]
values = []
for point in points:
value.append(arr[point])
# values -> [1, 0]
This works independent of dimensionality of the Numpy array involved.
Bonus: In addition to appending to a list, you can also use the Python slice function to extract ranges directly:
arr = np.array([[[0, 1]]])
points = (0, 0, slice(2) )
vals = arr[points]
# --> [0 1] (a Numpy array!)

Efficiently permute array in row wise using Numpy

Given a 2D array, I would like to permute this array row-wise.
Currently, I will create a for loop to permute the 2D array row by row as below:
for i in range(npart):
pr=np.random.permutation(range(m))
# arr_rand3 is the same as arr, but with each row permuted
arr_rand3[i,:]=arr[i,pr]
But, I wonder whether there is some setting within Numpy that can perform this in single line (without the for-loop).
The full code is
import numpy as np
arr=np.array([[0,0,0,0,0],[0,4,1,1,1],[0,1,1,2,2],[0,3,2,2,2]])
npart=len(arr[:,0])
m=len(arr[0,:])
# Permuted version of arr
arr_rand3=np.zeros(shape=np.shape(arr),dtype=np.int)
# Nodal association matrix for C
X=np.zeros(shape=(m,m),dtype=np.double)
# Random nodal association matrix for C_rand3
X_rand3=np.zeros(shape=(m,m),dtype=np.double)
for i in range(npart):
pr=np.random.permutation(range(m))
# arr_rand3 is the same as arr, but with each row permuted
arr_rand3[i,:]=arr[i,pr]
In Numpy 1.19+ you should be able to do:
import numpy as np
arr = np.array([[0, 0, 0, 0, 0], [0, 4, 1, 1, 1], [0, 1, 1, 2, 2], [0, 3, 2, 2, 2]])
rng = np.random.default_rng()
arr_rand3 = rng.permutation(arr, axis=1)
print(arr_rand3)
Output
[[0 0 0 0 0]
[4 0 1 1 1]
[1 0 1 2 2]
[3 0 2 2 2]]
According to the documentation, the method random.Generator.permutation receives a new parameter axis:
axis int, optional
The axis which x is shuffled along. Default is 0.

Adding values to non zero elements in a Sparse Matrix

I have a sparse matrix in which I want to increment all the values of non-zero elements by one. However, I cannot figure it out. Is there a way to do it using standard packages in python? Any help will be appreciated.
I cannot comment on it's performance but you can do (Scipy 1.1.0);
>>> from scipy.sparse import csr_matrix
>>> a = csr_matrix([[0, 2, 0], [1, 0, 0]])
>>> print(a)
(0, 1) 2
(1, 0) 1
>>> a[a.nonzero()] = a[a.nonzero()] + 1
>>> print(a)
(0, 1) 3
(1, 0) 2
If your matrix have 2 dimensions, you can do the following:
sparse_matrix = [[element if element==0 else element+1 for element in row ]for row in sparse_matrix]
It will iterate over every element of your matrix and return the element without any change if it is equals to zero, else it add 1 to the element and return it.
More about conditionals in list comprehension in the answer for this question.
You can use the package numpy which has efficient functions for dealing with n-dimensional arrays. What you need is:
array[array>0] += 1
where array is the numpy array of your matrix. Example here:
`
import numpy as np
my_matrix = [[2,0,0,0,7],[0,0,0,4,0]]
array = np.array(my_matrix);
print("Matrix before incrementing values: \n", array)
array[array>0] += 1
print("Matrix after incrementing values: \n", array)`
Outputs:
Matrix before incrementing values:
[[2 0 0 0 7]
[0 0 0 4 0]]
Matrix after incrementing values:
[[3 0 0 0 8]
[0 0 0 5 0]]
Hope this helps!
When you have a scipy sparse matrix (scipy.sparse) is:
import scipy.sparse as sp
my_matrix = [[2,0,0,0,7],[0,0,0,4,0]]
my_matrix = sp.csc_matrix(my_matrix)
my_matrix.data += 1
my_matrix.todense()
Returns:
[[3, 0, 0, 0, 8], [0, 0, 0, 5, 0]]

numpy/scipy build adjacency matrix from weighted edgelist

I'm reading a weighted egdelist / numpy array like:
0 1 1
0 2 1
1 2 1
1 0 1
2 1 4
where the columns are 'User1','User2','Weight'. I'd like to perform a DFS algorithm with scipy.sparse.csgraph.depth_first_tree, which requires a N x N matrix as input. How can I convert the previous list into a square matrix as:
0 1 1
1 0 1
0 4 0
within numpy or scipy?
Thanks for your help.
EDIT:
I've been working with a huge (150 million nodes) network, so I'm looking for a memory efficient way to do that.
You could use a memory-efficient scipy.sparse matrix:
import numpy as np
import scipy.sparse as sparse
arr = np.array([[0, 1, 1],
[0, 2, 1],
[1, 2, 1],
[1, 0, 1],
[2, 1, 4]])
shape = tuple(arr.max(axis=0)[:2]+1)
coo = sparse.coo_matrix((arr[:, 2], (arr[:, 0], arr[:, 1])), shape=shape,
dtype=arr.dtype)
print(repr(coo))
# <3x3 sparse matrix of type '<type 'numpy.int64'>'
# with 5 stored elements in COOrdinate format>
To convert the sparse matrix to a dense numpy array, you could use todense:
print(coo.todense())
# [[0 1 1]
# [1 0 1]
# [0 4 0]]
Try something like the following:
import numpy as np
import scipy.sparse as sps
A = np.array([[0, 1, 1],[0, 2, 1],[1, 2, 1],[1, 0, 1],[2, 1, 4]])
i, j, weight = A[:,0], A[:,1], A[:,2]
# find the dimension of the square matrix
dim = max(len(set(i)), len(set(j)))
B = sps.lil_matrix((dim, dim))
for i,j,w in zip(i,j,weight):
B[i,j] = w
print B.todense()
>>>
[[ 0. 1. 1.]
[ 1. 0. 1.]
[ 0. 4. 0.]]

Sum over rows in scipy.sparse.csr_matrix

I have a big csr_matrix and I want to add over rows and obtain a new csr_matrix with the same number of columns but reduced number of rows. (Context: The matrix is a document-term matrix obtained from sklearn CountVectorizer and I want to be able to quickly combine documents according to codes associated with these documents)
For a minimal example, this is my matrix:
import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse import vstack
row = np.array([0, 4, 1, 3, 2])
col = np.array([0, 2, 2, 0, 1])
dat = np.array([1, 2, 3, 4, 5])
A = csr_matrix((dat, (row, col)), shape=(5, 5))
print A.toarray()
[[1 0 0 0 0]
[0 0 3 0 0]
[0 5 0 0 0]
[4 0 0 0 0]
[0 0 2 0 0]]
No let's say I want a new matrix B in which rows (1, 4) and (2, 3, 5) are combined by summing them, which would look something like this:
[[5 0 0 0 0]
[0 5 5 0 0]]
And should be again in sparse format (because the real data I'm working with is large). I tried to sum over slices of the matrix and then stack it:
idx1 = [1, 4]
idx2 = [2, 3, 5]
A_sub1 = A[idx1, :].sum(axis=1)
A_sub2 = A[idx2, :].sum(axis=1)
B = vstack((A_sub1, A_sub2))
But this gives me the summed up values just for the non-zero columns in the slice, so I can't combine it with the other slices because the number of columns in the summed slices are different.
I feel like there must be an easy way to do this. But I couldn't find any discussion of this online or in the documentation. What am I missing?
Thank you for your help
Note that you can do this by carefully constructing another matrix. Here's how it would work for a dense matrix:
>>> S = np.array([[1, 0, 0, 1, 0,], [0, 1, 1, 0, 1]])
>>> np.dot(S, A.toarray())
array([[5, 0, 0, 0, 0],
[0, 5, 5, 0, 0]])
>>>
The sparse version is only a little more complicated. The information about which rows should be summed together is encoded in row:
col = range(5)
row = [0, 1, 1, 0, 1]
dat = [1, 1, 1, 1, 1]
S = csr_matrix((dat, (row, col)), shape=(2, 5))
result = S * A
# check that the result is another sparse matrix
print type(result)
# check that the values are the ones we want
print result.toarray()
Output:
<class 'scipy.sparse.csr.csr_matrix'>
[[5 0 0 0 0]
[0 5 5 0 0]]
You can handle more rows in your output by including higher values in row and extending the shape of S accordingly.
The indexing should be:
idx1 = [0, 3] # rows 1 and 4
idx2 = [1, 2, 4] # rows 2,3 and 5
Then you need to keep A_sub1 and A_sub2 in sparse format and use axis=0:
A_sub1 = csr_matrix(A[idx1, :].sum(axis=0))
A_sub2 = csr_matrix(A[idx2, :].sum(axis=0))
B = vstack((A_sub1, A_sub2))
B.toarray()
array([[5, 0, 0, 0, 0],
[0, 5, 5, 0, 0]])
Note, I think the A[idx, :].sum(axis=0) operations involve conversion from sparse matrices - so #Mr_E's answer is probably better.
Alternatively, it works when you use axis=0 and np.vstack (as opposed to scipy.sparse.vstack):
A_sub1 = A[idx1, :].sum(axis=0)
A_sub2 = A[idx2, :].sum(axis=0)
np.vstack((A_sub1, A_sub2))
Giving:
matrix([[5, 0, 0, 0, 0],
[0, 5, 5, 0, 0]])

Categories

Resources