SciPy: Symmetric permutation of sparse CSR matrix - python

I'd like to symmetrically permute a sparse matrix, permuting rows and columns in the same way. For example, I would like to rotate the rows and columns, which takes:
1 2 3
0 1 0
0 0 1
to
1 0 0
0 1 0
2 3 1
In Octave or MATLAB, one can do this concisely with matrix indexing:
A = sparse([1 2 3; 0 1 0; 0 0 1]);
perm = [2 3 1];
Aperm = A(perm,perm);
I am interested in doing this in Python, with NumPy/SciPy. Here is an attempt:
#!/usr/bin/env python
import numpy as np
from scipy.sparse import csr_matrix
row = np.array([0, 0, 0, 1, 2])
col = np.array([0, 1, 2, 1, 2])
data = np.array([1, 2, 3, 1, 1])
A = csr_matrix((data, (row, col)), shape=(3, 3))
p = np.array([1, 2, 0])
#Aperm = A[p,p] # gives [1,1,1], the permuted diagonal
Aperm = A[:,p][p,:] # works, but more verbose
Is there a cleaner way to accomplish this sort of symmetric permutation of a matrix?
(I'm more interested in concise syntax than I am in performance)

In MATLAB
A(perm,perm)
is a block operation. In numpy A[perm,perm] selects elements on the diagonal.
A[perm[:,None], perm]
is the block indexing. The MATLAB diagonal requires something like sub2ind. What's concise in one is more verbose in the other, and v.v.
Actually numpy is using the same logic in both cases. It 'broadcasts' one index against the other, A (n,) against (n,) in the diagonal case, and (n,1) against (1,n) in the block case. The results are (n,) and (n,n) shaped.
This numpy indexing works with sparse matrices as well, though it isn't as fast. It actually uses matrix multiplication to do this sort of indexing - with an 'extractor' matrix based on the indices (maybe 2, M*A*M.T).
MATLAB's documentation about a permutation matrix:
https://www.mathworks.com/help/matlab/math/sparse-matrix-operations.html#f6-13070

Related

What is the most efficient way to convert from a list of values to a scipy sparse matrix?

I have a list of values that I'm using a loop to convert to a scipy.sparse.dok_matrix. I'm aware of numpy.bincount but it doesn't work with sparse matrices. I'm wondering if there is a more efficient way to perform this conversion because the construction time for a dok_matrix is really long.
Example below for one row but I'm scaling to a 2D matrix by looping. The number of times a value x appears in the input list is the value of the xth element of the result matrix.
values = [1, 3, 3, 4]
expected_result = [0, 1, 0, 2, 1]
matrix = dok_matrix((1, MAXIMUM_EXPECTED_VALUE))
for value in values:
matrix[0, value] = matrix.get((0, card)) + 1
MAXIMUM_EXPECTED_VALUE is in the order of 100000000 but len(values) < 100, which is why I'm using a sparse matrix. Possibly off-topic: there are also only a little over 10000 actual values that are used in the range of MAXIMUM_EXPECTED_VALUE but I think hashing to a contiguous range and converting back might be more complicated.
Looks like the standard coo style inputs suits you case:
In [143]: from scipy import sparse
In [144]: values = [1,3,3,4]
In [145]: col = np.array(values)
In [146]: row = np.zeros_like(col)
In [147]: data = np.ones_like(col)
In [148]: M = sparse.coo_matrix((data, (row,col)), shape=(1,10))
In [149]: M
Out[149]:
<1x10 sparse matrix of type '<class 'numpy.int64'>'
with 4 stored elements in COOrdinate format>
In [150]: M.A
Out[150]: array([[0, 1, 0, 2, 1, 0, 0, 0, 0, 0]])

Convert numpy array with values into array with frequency for each observation in each row

I have a numpy array as follows:
array = np.random.randint(6, size=(50, 400))
This array has the cluster that each value belongs to, with each row representing a sample and each column representing a feature, but I would like to create a 5 dimensional array with the frequency of each cluster (in each sample, represented as a row in this matrix).
However, in the frequency calculation, I want to ignore 0, meaning that the frequency of all values except 0 (1-5) should add to 1.
Essentially what I want is a array with each row being a cluster (1-5) in this case, and each row still contains a single sample.
How can this be done?
Edit:
small input:
input = np.random.randint(6, size=(2, 5))
array([[0, 4, 2, 3, 0],
[5, 5, 2, 5, 3]])
output:
1 2 3 4 5
0 .33 .33 .33 0
0 .2 .2 0 .6
Where 1-5 are the row names, and the bottom two rows are the desired output in a numpy array.
This is a simple application of bincount. Does this do what you want?
def freqs(x):
counts = np.bincount(x, minlength=6)[1:]
return counts/counts.sum()
frequencies = np.apply_along_axis(freqs, axis=1, arr=array)
If you were wondering about the speed implications of apply_along_axis, this method using tricky indexing is marginally slower in my tests:
counts = (array[:, :, None] == values[None, None, :]).sum(axis=1)
frequencies2 = counts/counts.sum(axis=1)[:, None]

Convert scipy condensed distance matrix to lower matrix read by rows

I have a condensed distance matrix from scipy that I need to pass to a C function that requires the matrix be converted to the lower triangle read by rows. For example:
0 1 2 3
0 4 5
0 6
0
The condensed form of this is: [1,2,3,4,5,6] but I need to convert it to
0
1 0
2 4 0
3 5 6 0
The lower triangle read by rows is: [1,2,4,3,5,6].
I was hoping to convert the compact distance matrix to this form without creating a redundant matrix.
Here's a quick implementation--but it creates the square redundant distance matrix as an intermediate step:
In [128]: import numpy as np
In [129]: from scipy.spatial.distance import squareform
c is the condensed form of the distance matrix:
In [130]: c = np.array([1, 2, 3, 4, 5, 6])
d is the redundant square distance matrix:
In [131]: d = squareform(c)
Here's your condensed lower triangle distances:
In [132]: d[np.tril_indices(d.shape[0], -1)]
Out[132]: array([1, 2, 4, 3, 5, 6])
Here's a method that avoids forming the redundant distance matrix. The function condensed_index(i, j, n) takes the row i and column j of the redundant distance matrix, with j > i, and returns the corresponding index in the condensed distance array.
In [169]: def condensed_index(i, j, n):
...: return n*i - i*(i+1)//2 + j - i - 1
...:
As above, c is the condensed distance array.
In [170]: c
Out[170]: array([1, 2, 3, 4, 5, 6])
In [171]: n = 4
In [172]: i, j = np.tril_indices(n, -1)
Note that the arguments are reversed in the following call:
In [173]: indices = condensed_index(j, i, n)
indices gives the desired permutation of the condensed distance array.
In [174]: c[indices]
Out[174]: array([1, 2, 4, 3, 5, 6])
(Basically the same function as condensed_index(i, j, n) was given in several answers to this question.)

Operations on 'N' dimensional numpy arrays

I am attempting to generalize some Python code to operate on arrays of arbitrary dimension. The operations are applied to each vector in the array. So for a 1D array, there is simply one operation, for a 2-D array it would be both row and column-wise (linearly, so order does not matter). For example, a 1D array (a) is simple:
b = operation(a)
where 'operation' is expecting a 1D array. For a 2D array, the operation might proceed as
for ii in range(0,a.shape[0]):
b[ii,:] = operation(a[ii,:])
for jj in range(0,b.shape[1]):
c[:,ii] = operation(b[:,ii])
I would like to make this general where I do not need to know the dimension of the array beforehand, and not have a large set of if/elif statements for each possible dimension.
Solutions that are general for 1 or 2 dimensions are ok, though a completely general solution would be preferred. In reality, I do not imagine needing this for any dimension higher than 2, but if I can see a general example I will learn something!
Extra information:
I have a matlab code that uses cells to do something similar, but I do not fully understand how it works. In this example, each vector is rearranged (basically the same function as fftshift in numpy.fft). Not sure if this helps, but it operates on an array of arbitrary dimension.
function aout=foldfft(ain)
nd = ndims(ain);
for k = 1:nd
nx = size(ain,k);
kx = floor(nx/2);
idx{k} = [kx:nx 1:kx-1];
end
aout = ain(idx{:});
In Octave, your MATLAB code does:
octave:19> size(ain)
ans =
2 3 4
octave:20> idx
idx =
{
[1,1] =
1 2
[1,2] =
1 2 3
[1,3] =
2 3 4 1
}
and then it uses the idx cell array to index ain. With these dimensions it 'rolls' the size 4 dimension.
For 5 and 6 the index lists would be:
2 3 4 5 1
3 4 5 6 1 2
The equivalent in numpy is:
In [161]: ain=np.arange(2*3*4).reshape(2,3,4)
In [162]: idx=np.ix_([0,1],[0,1,2],[1,2,3,0])
In [163]: idx
Out[163]:
(array([[[0]],
[[1]]]), array([[[0],
[1],
[2]]]), array([[[1, 2, 3, 0]]]))
In [164]: ain[idx]
Out[164]:
array([[[ 1, 2, 3, 0],
[ 5, 6, 7, 4],
[ 9, 10, 11, 8]],
[[13, 14, 15, 12],
[17, 18, 19, 16],
[21, 22, 23, 20]]])
Besides the 0 based indexing, I used np.ix_ to reshape the indexes. MATLAB and numpy use different syntax to index blocks of values.
The next step is to construct [0,1],[0,1,2],[1,2,3,0] with code, a straight forward translation.
I can use np.r_ as a short cut for turning 2 slices into an index array:
In [201]: idx=[]
In [202]: for nx in ain.shape:
kx = int(np.floor(nx/2.))
kx = kx-1;
idx.append(np.r_[kx:nx, 0:kx])
.....:
In [203]: idx
Out[203]: [array([0, 1]), array([0, 1, 2]), array([1, 2, 3, 0])]
and pass this through np.ix_ to make the appropriate index tuple:
In [204]: ain[np.ix_(*idx)]
Out[204]:
array([[[ 1, 2, 3, 0],
[ 5, 6, 7, 4],
[ 9, 10, 11, 8]],
[[13, 14, 15, 12],
[17, 18, 19, 16],
[21, 22, 23, 20]]])
In this case, where 2 dimensions don't roll anything, slice(None) could replace those:
In [210]: idx=(slice(None),slice(None),[1,2,3,0])
In [211]: ain[idx]
======================
np.roll does:
indexes = concatenate((arange(n - shift, n), arange(n - shift)))
res = a.take(indexes, axis)
np.apply_along_axis is another function that constructs an index array (and turns it into a tuple for indexing).
If you are looking for a programmatic way to index the k-th dimension an n-dimensional array, then numpy.take might help you.
An implementation of foldfft is given below as an example:
In[1]:
import numpy as np
def foldfft(ain):
result = ain
nd = len(ain.shape)
for k in range(nd):
nx = ain.shape[k]
kx = (nx+1)//2
shifted_index = list(range(kx,nx)) + list(range(kx))
result = np.take(result, shifted_index, k)
return result
a = np.indices([3,3])
print("Shape of a = ", a.shape)
print("\nStarting array:\n\n", a)
print("\nFolded array:\n\n", foldfft(a))
Out[1]:
Shape of a = (2, 3, 3)
Starting array:
[[[0 0 0]
[1 1 1]
[2 2 2]]
[[0 1 2]
[0 1 2]
[0 1 2]]]
Folded array:
[[[2 0 1]
[2 0 1]
[2 0 1]]
[[2 2 2]
[0 0 0]
[1 1 1]]]
You could use numpy.ndarray.flat, which allows you to linearly iterate over a n dimensional numpy array. Your code should then look something like this:
b = np.asarray(x)
for i in range(len(x.flat)):
b.flat[i] = operation(x.flat[i])
The folks above provided multiple appropriate solutions. For completeness, here is my final solution. In this toy example for the case of 3 dimensions, the function 'ops' replaces the first and last element of a vector with 1.
import numpy as np
def ops(s):
s[0]=1
s[-1]=1
return s
a = np.random.rand(4,4,3)
print '------'
print 'Array a'
print a
print '------'
for ii in np.arange(a.ndim):
a = np.apply_along_axis(ops,ii,a)
print '------'
print ' Axis',str(ii)
print a
print '------'
print ' '
The resulting 3D array has a 1 in every element on the 'border' with the numbers in the middle of the array unchanged. This is of course a toy example; however ops could be any arbitrary function that operates on a 1D vector.
Flattening the vector will also work; I chose not to pursue that simply because the book-keeping is more difficult and apply_along_axis is the simplest approach.
apply_along_axis reference page

Converting bsxfun with #times to numpy

This is the code I have in Octave:
sum(bsxfun(#times, X*Y, X), 2)
The bsxfun part of the code produces element-wise multiplication so I thought that numpy.multiply(X*Y, X) would do the trick but I got an exception. When I did a bit of research I found that element-wise multiplication won't work on Python arrays (specifically if X and Y are of type "numpy.ndarray"). So I was wondering if anyone can explain this a bit more -- i.e. would type casting to a different type of object work? The Octave code works so I know I don't have a linear algebra mistake. I'm assuming that bsxfun and numpy.multiply are not actually equivalent but I'm not sure why so any explanations would be great.
I was able to find a website! that gives Octave to Matlab function conversions but it didn't seem to be help in my case.
bsxfun in Matlab stand for binary singleton expansion, in numpy it's called broadcasting and should happen automatically. The solution will depend on the dimensions of your X, i.e. is it a row or column vector but this answer shows one way to do it:
How to multiply numpy 2D array with numpy 1D array?
I think that the issue here is that broadcasting requires one of the dimensions to be 1 and, unlike Matlab, numpy seems to differentiate between a 1 dimensional 2 element vector and a 2 dimensional 2 element, i.e. the difference between a matrix of shape (2,) and of shape (2,1), you need the latter for broadcasting to happen.
For those who don't know Numpy, I think it's worth pointing out that the equivalent of Octave's (and Matlab's) * operator (matrix multiplication) is numpy.dot (and, debatably, numpy.outer). Numpy's * operator is similar to bsxfun(#times,...) in Octave, which is itself a generalization of .*.
In Octave, when applying bsxfun, there are implicit singleton dimensions to the right of the "true" size of the operands; that is, an n1 x n2 x n3 array can be considered as n1 x n2 x n3 x 1 x 1 x 1 x.... In Numpy, the implicit singleton dimensions are to the left; so an m1 x m2 x m3 can be considered as ... x 1 x 1 x m1 x m2 x m3. This matters when considering operand sizes: in Octave, bsxfun(#times,a,b) will work if a is 2 x 3 x 4 and b is 2 x 3. In Numpy one could not multiply two such arrays, but one could multiply a 2 x 3 x 4 and a 3 x 4 array.
Finally, bsxfun(#times, X*Y, X) in Octave will probably look something like numpy.dot(X,Y) * X. There are still some gotchas: for instance, if you're expecting an outer product (that is, in Octave X is a column vector, Y a row vector), you could look at using numpy.outer instead, or be careful about the shape of X and Y.
Little late, but I would like to provide an example of having equivalent bsxfun and repmat in python. This is a little bit of code I was just converting from Matlab to Python numpy:
Matlab:
x =
-2
-1
0
1
2
n =
2
M = repmat(x,1,n+1)
M =
-2 -2 -2
-1 -1 -1
0 0 0
1 1 1
2 2 2
M = bsxfun(#power,M,0:n)
M =
1 -2 4
1 -1 1
1 0 0
1 1 1
1 2 4
Equivalent in Python:
In [8]: x
Out[8]:
array([[-2],
[-1],
[ 0],
[ 1],
[ 2]])
In [9]: n=2
In [11]: M = np.tile(x, (1, n + 1))
In [12]: M
Out[12]:
array([[-2, -2, -2],
[-1, -1, -1],
[ 0, 0, 0],
[ 1, 1, 1],
[ 2, 2, 2]])
In [13]: M = np.apply_along_axis(pow, 1, M, range(n + 1))
In [14]: M
Out[14]:
array([[ 1, -2, 4],
[ 1, -1, 1],
[ 1, 0, 0],
[ 1, 1, 1],
[ 1, 2, 4]])

Categories

Resources