scipy sparse matrix -- accessing multiple elements of a path - python

I have a scipy sparse matrix A and a (long) list of coordinates
myrows=[i1,i2,...] mycols=[j1,j2,...]. I need a list of their values [A[i1,j2],A[i2,j2],...]. How can I do this quickly. A loop is too slow.
I've thought about cython.inline() (which I use in other places in my code) or weave, but I don't see how to use the sparse type efficiently in cython or C++. Am I missing something simple?
Currently I'm using a hack that seems inefficient and possibly wrong sometimes -- which I flag with an error message. Here is my badly written code. Note that it relies on the ordering of elements to be preserved under addition and assumes that the elements in myrows,mycols are in A.
import scipy.sparse as sps
def getmatvals(A,myrows,mycols) #A is a coo_matrix
B = sps.coo_matrix((range(1,1+A.nnz),(A.row,A.col)),shape=A.shape)
T = sps.coo_matrix(([A.nnz+1]*len(myrows),(myrows,mycols)),shape=A.shape)
G = B-T #signify myelements in G by negatives and others by 0's
H = np.minimum([0]*A.nnz,G.data) #remove extra elements
H = H[np.nonzero(H)]
H = H + A.nnz
return A.data[H]

Related

Is it possible to translate this Python code to Cython?

I'm actually looking to speed up #2 of this code by as much as possible, so I thought that it might be useful to try Cython. However, I'm not sure how to implement sparse matrix in Cython. Can somebody show how to / if it's possible to wrap it in Cython or perhaps Julia to make it faster?
#1) This part computes u_dict dictionary filled with unique strings and then enumerates them.
import scipy.sparse as sp
import numpy as np
from scipy.sparse import csr_matrix
full_dict = set(train1.values.ravel().tolist() + test1.values.ravel().tolist() + train2.values.ravel().tolist() + test2.values.ravel().tolist())
print len(full_dict)
u_dict= dict()
for i, q in enumerate(full_dict):
u_dict[q] = i
shape = (len(full_dict), len(full_dict))
H = sp.lil_matrix(shape, dtype=np.int8)
def load_sparse_csr(filename):
loader = np.load(filename)
return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
shape=loader['shape'])
#2) I need to speed up this part
# train_full is pandas dataframe with two columns w1 and w2 filled with strings
H = load_sparse_csr('matrix.npz')
correlation_train = []
for idx, row in train_full.iterrows():
if idx%1000 == 0: print idx
id_1 = u_dict[row['w1']]
id_2 = u_dict[row['w2']]
a_vec = H[id_1].toarray() # these vectors are of length of < 3 mil.
b_vec = H[id_2].toarray()
correlation_train.append(np.corrcoef(a_vec, b_vec)[0][1])
While I contributed to How to properly pass a scipy.sparse CSR matrix to a cython function? quite some time ago, I doubt if cython is the way to go. Especially if you don't already have experience with numpy and cython. cython gives the biggest speedup when you replace iterative calculations with code that it can translate to C without calling numpy or other python code. Throw pandas into the mix and you have an even bigger learning curve.
And important parts of sparse code are already written with cython.
Without touching the cython issue I see a couple of problems.
H is defined twice:
H = sp.lil_matrix(shape, dtype=np.int8)
H = load_sparse_csr('matrix.npz')
That's either an oversight, or a failure to understand how Python variables are created and assigned. The 2nd assignment replaces the first; thus the first does nothing. In addition the first just makes an empty lil matrix. Such a matrix could be filled iteratively; while not fast it is the intended use of the lil format.
The 2nd expression creates a new matrix from data saved in an npz file. That involves the numpy npz file loaded as well as the basic csr matrix creation code. And since the attributes are already in csr format, there's nothing for cython touch.
You do have an iteration here - but over a Pandas dataframe:
for idx, row in train_full.iterrows():
id_1 = u_dict[row['w1']]
a_vec = H[id_1].toarray()
Looks like you are picking a particular row of H based on a dictionary/array look up. Sparse matrix indexing is slow compared to dense matrix indexing. That is, if Ha = H.toarray() fits your memory then,
a_vec = Ha[id_1,:]
will be a lot faster.
Faster selection of rows (or columns) from a sparse matrix has been asked before. If you could work directly with the sparse data of a row I could recommend something more direct. But you want a dense array that you can pass to np.corrcoef, so we'd have to implement the toarray step as well.
How to read/traverse/slice Scipy sparse matrices (LIL, CSR, COO, DOK) faster?

Substitute elements of a matrix at specific coordinates in python

I am trying to solve a "very simple" problem. Not so simple in Python. Given a large matrix A and another smaller matrix B I want to substitute certain elements of A with B.
In Matlab is would look like this:
Given A, row_coord = [1,5,6] col_coord = [2,4], and a matrix B of size(3X2), A[row_coord, col_coord] = B
In Python I tried to use product(row_coord, col_coord) from the itertools to generate the set of all indexes that need to be accessible in A but it does not work. All examples on submatrix substitution refer to block-wise row_coord = col_coord examples. Nothing concrete except for the http://comments.gmane.org/gmane.comp.python.numeric.general/11912 seems to relate to the problem that I am facing and the code in the link does not work.
Note: I know that I can implement what I need via the double for-loop, but on my data such a loop adds 9 secs to the run of one iteration and I am looking for a faster way to implement this.
Any help will be greatly appreciated.
Assuming you're using numpy arrays then (in the case where your B is a scalar) the following code should work to assign the chosen elements to the value of B.
itertools.product will create all of the coordinate pairs which we then convert into a numpy array and use in indexing your original array:
import numpy as np
from itertools import product
A = np.zeros([20,20])
col_coord = [0,1,3]
row_coord = [1,2]
coords = np.array(list(product(row_coord, col_coord)))
B = 1
A[coords[:,0], coords[:,1]] = B
I used this excellent answer by unutbu to work out how to do the indexing.

Cython function with variable sized matrix input

I am trying to convert part of a native python function to cython to improve the compute time. I would like to write a cython function just for the loop component that is taking up the time (as ipython lprun kindly told me). However this function takes in variably sized matrices .. and I can't see how to bring that across easily to statically typed cython.
for index1 in range(0,num_products):
for index2 in range(0,num_products):
cond_prob = (data[index1] * data[index2]).sum() / max(col_sums[index1], col_sums[index2])
prox[index1][index2] = cond_prob
This issue is that num_products changes year to year, so the matrix (data) size is variable.
What is the best strategy here?
Should I write two C functions. One to create a matrix of a certain dimension using memalloc, and then One to do the loops over the created matrix?
Is there some fancy cython/numpy wizardry to help in this scenario? Can I write a C function that takes in a variably sized Numpy Array in memory and pass the size?
Cython code is (strategically) statically typed, but that doesn't mean that arrays must have a fixed size. In straight C passing a multidimensional array to a function can be a little awkward maybe, but in Cython you should be able to do something like the following:
Note I took the function and variable names from your follow-up question.
import numpy as np
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.cdivision(True)
def cooccurance_probability_cy(double[:,:] X):
cdef int P, i, j, k
P = X.shape[0]
cdef double item
cdef double [:] CS = np.sum(X, axis=1)
cdef double [:,:] D = np.empty((P, P), dtype=np.float)
for i in range(P):
for j in range(P):
item = 0
for k in range(P):
item += X[i,k] * X[j,k]
D[i,j] = item / max(CS[i], CS[j])
return D
On the other hand, using just Numpy should also be quite fast for this problem, if you use the right functions and some broadcasting. In fact, as the calculation complexity is dominated by the matrix multiplication, I found the following is much faster than the Cython code above (np.inner uses a highly optimized BLAS routine):
def new(X):
CS = np.sum(X, axis=1, keepdims=True)
D = np.inner(X,X) / np.maximum(CS, CS.T)
return D
Have you tried getting rid of the for loops in numpy?
for the first part of your equation you could for example try:
(data[ np.newaxis,:] * data[:,np.newaxis]).sum(2)
if memory is an issue you can also use the np.einsum() function.
For the second part one could probably also cook up a numpy expression (bit more difficult) if you've not already tried that.

Python/Numpy: Build 2D array without adding duplicate rows (for triangular mesh)

I'm working on some code that manipulates 3D triangular meshes. Once I have imported mesh data, I need to "unify" vertices that are at the same point in space.
I've been assuming that numpy arrays would be the fastest way of storing & manipulating the data, but I can't seem to find a fast way of building a list of vertices while avoiding adding duplicate entries.
So, to test out methods, creating a 3x30000 array with 10000 unique rows:
import numpy as np
points = np.random.random((10000,3))
raw_data = np.concatenate((points,points,points))
np.random.shuffle(raw_data)
This serves as a good approximation of mesh data, with each point appearing as a facet vertex 3 times. While unifying, I need to build a list of unique vertices; if a point already is in the list a reference to it must be stored.
The best I've been able to come up with using numpy so far has been the following:
def unify(raw_data):
# first point must be new
unified_verts = np.zeros((1,3),dtype=np.float64)
unified_verts[0] = raw_data[0]
ref_list = [0]
for i in range(1,len(raw_data)):
point = raw_data[i]
index_array = np.where(np.all(point==unified_verts,axis=1))[0]
# point not in array yet
if len(index_array) == 0:
point = np.expand_dims(point,0)
unified_verts = np.concatenate((unified_verts,point))
ref_list.append(len(unified_verts)-1)
# point already exists
else:
ref_list.append(index_array[0])
return unified_verts, ref_list
Testing using cProfile:
import cProfile
cProfile.run("unify(raw_data)")
On my machine this runs in 5.275 seconds. I've though about using Cython to speed it up, but from what I've read Cython doesn't typically run much faster than numpy methods. Any advice on ways to do this more efficiently?
Jaime has shown a neat trick which can be used to view a 2D array as a 1D array with items that correspond to rows of the 2D array. This trick can allow you to apply numpy functions which take 1D arrays as input (such as np.unique) to higher dimensional arrays.
If the order of the rows in unified_verts does not matter (as long as the ref_list is correct with respect to unifed_verts), then you could use np.unique along with Jaime's trick like this:
def unify2(raw_data):
dtype = np.dtype((np.void, (raw_data.shape[1] * raw_data.dtype.itemsize)))
uniq, inv = np.unique(raw_data.view(dtype), return_inverse=True)
uniq = uniq.view(raw_data.dtype).reshape(-1, raw_data.shape[1])
return uniq, inv
The result is the same in the sense that the raw_data can be reconstructed from the return values of unify (or unify2):
unified, ref = unify(raw_data)
uniq, inv = unify2(raw_data)
assert np.allclose(uniq[inv], unified[ref]) # raw_data
On my machine, unified, ref = unify(raw_data) requires about 51.390s, while uniq, inv = unify2(raw_data) requires about 0.133s (~ 386x speedup).

In SciPy, fancy indexing for csr_matrices

I am new to Python, so forgive me ahead of time if this is an elementary question, but I have searched around and have not found a satisfying answer.
I am trying to do the following using NumPy and SciPy:
I,J = x[:,0], x[:1] # x is a two column array of (r,c) pairs
V = ones(len(I))
G = sparse.coo_matrix((V,(I,J))) # G's dimensions are 1032570x1032570
G = G + transpose(G)
r,c = G.nonzero()
G[r,c] = 1
...
NotImplementedError: Fancy indexing in assignment not supported for csr matrices
Pretty much, I want all the nonzero values to equal 1 after adding the transpose, but I get the fancy indexing error messages.
Alternatively, if I could show that the matrix G is symmetric, adding the transpose would not be necessary.
Any insight into either approach would be very much appreciated.
In addition to doing something like G = G / G, you can operate on G.data.
So, in your case, doing either:
G.data = np.ones(G.nnz)
or
G.data[G.data != 0] = 1
Will do what you want. This is more flexible, as it allows you to preform other types of filters (e.g. G.data[G.data > 0.9] = 1 or G.data = np.random.random(G.nnz))
The second option will only set the values to one if they have a nonzero value. During some calculations, you'll wind up with zero values that are "dense" (i.e. they're actually stored as a value in the sparse array). (You can remove these in-place with G.eliminate_zeros())

Categories

Resources