Why accessing a sparse matrix is costly?

Why accessing a sparse matrix is costly? - python

I have a 1034_by_1034 sparse matrix (scipy.sparse.csr.csr_matrix), which basically represents the adjacency matrix of a graph. I want to check if some elements are ones or not. But I found this to be a very slow operation. Before the if statement the code runs in 11 seconds, but when I enable the if check, it takes 40 seconds!
Here's my code snippet:
target = list()
for edge_id in edges_ids:
v1_label, v2_label = from_edgeID_to_vertix_labels(edge_id) #fast
v1_index = g.get_v_index(v1_label) #fast
v2_index = g.get_v_index(v2_label) #fast
#if the following chunk is enabled, it becomes slow!
if A[v1_index, v2_index] == 1:
target.append(1)
else:
target.append(0)
g.target = target

The reason is quite likely to be the fact that fetching a single value from a sparse matrix in CSR (or CSC form), given indices (i, j), is very expensive. Algorithms for these sparse matrix representations aren't usually designed to do that: they're designed to use the indices they find as they go through the arrays sequentially.
In CSR, when you look up a row, you effectively get an array of column indices and the corresponding values. If you're fetching a single value, you have to do a linear search through the little array of column indices (unsorted in general) to see if it's there (otherwise the value is zero); if found, you then pick the value out of the value array and return it. It might look a bit like this ad-hoc C (this is intended to be illustrative):
/* Obviously silly CSR matrix typedef */
typedef struct sparse_s {
int row[nnz+1];
int col[nnz];
double value[nnz];
} sparse_s;
double spGetValue(sparse_s const* s, int i, int j)
{
int k;
for(k=s->row[i]; k<s->row[i+1]; k++) {
if( j == s->col[k] ) {
return s->value[k];
}
}
return 0.0;
}
So, if you were to average 10 elements on every row, you have to search through a ten element array for every access. This is much less of a problem for algorithms like SpMV that use the column indices as they find them. If you implemented SpMV like dense MM, fetching every value, it would be horribly horribly slow even if you had some oracular magic way of skipping the zeros. If you think that's bad, inserting an element into a CSR/CSC matrix is so viciously expensive that it's (almost) never done.
In short, you might get better results by either reorganizing your code so that you're iterating over the three vectors of the CSR matrix directly or using a different sparse matrix representation for this particular problem.
It might well be something more “Pythoney”, but I wouldn't expect your code to perform well even in a best-case scenario in C if the matrix representation and access method were retained.

In this case, you may be better off using a nested defaultdict:
from collections import defaultdict
A = defaultdict(lambda : defaultdict(int))
# Example of how to set an element in the adjacency matrix:
A[1][2] = 1
However, that does not support any of the matrix manipulations offered by numpy or scipy, but it should be fast for that particular use case.

Related

How is numpy.einsum implemented?

I want to understand how is einsum function in python implemented. I found the source code in numpy/core/src/multiarray/einsum.c.src file but couldn't completely understand it. In particular I want to understand how does it creates the required loops automatically?
For example:
import numpy as np
a = np.random.rand(2,3,4,5)
b = np.random.rand(5,3,2,4)
ll = np.einsum('ijkl, ljik ->', a,b) # This should loop over all the
# four indicies i,j,k,l. How does it create loops for these indices automatically ?
# The assume that under the hood it does the following
sum1 = 0
for i in range(2):
for j in range(3):
for k in range(4):
for l in range(5):
sum1 = sum1 + a[i,j,k,l]*b[l,j,i,k]
Thank you in advance
ps: This question is not about how to use numpy.einsum

I want to understand how does it creates the required loops automatically?
Well, it does not create the loops the way you think it does. In this case, it creates an iterator operating over multiple arrays and then use it in a generic main loop. In the more general case, there are two main loops: one to iterate over the output array items and one to perform a reduction.
The main function is PyArray_EinsteinSum. In your case, it takes an unoptimized path and end up creating a basic iteration function based on the iterator created previously (ie. iter). This function is get_sum_of_products_function. It basically analyze the einsum operation so to find the best (sum of product) function to call based on a lookup table (like _outstride0_specialized_table). In your specific case, double_sum_of_products_outstride0_two is called. Numpy use a template system so to generate this function automatically at build time (*.c.src files are template files converted to *.c files based on predefined basic comments). In this case, the function is generated from #name#_sum_of_products_outstride0_#noplabel# and once computed by the C preprocessor it gives something like the following function:
static void double_sum_of_products_outstride0_two(int nop,
char **dataptr,
npy_intp const *strides,
npy_intp count)
{
npy_double accum = 0;
char *data0 = dataptr[0];
npy_intp stride0 = strides[0];
char *data1 = dataptr[1];
npy_intp stride1 = strides[1];
while (count--)
{
accum += (*(npy_double *)data0) * (*(npy_double *)data1);
data0 += stride0;
data1 += stride1;
}
*((npy_double *)dataptr[2]) = (accum + (*((npy_double *)dataptr[2])));
}
As you can see, there is only one main loop iterating over the previously generated iterator. In your case, stride0 and stride1 are both equal to 8, data0 and data1 are the raw input arrays, dataptr is the raw output array and count is set to 120 initially. Note that the fact both strides are equal to 8 is surprising at first glance since the einsum does not iterate on the two array contiguously. This is because the second array is copied and reorder because Numpy cannot create a uniform view based on the einsum parameters.
Note that the fallback case use for the example code is not particularly optimized and it only produce one value. For example, the much more optimized double_sum_of_products_contig_contig_outstride0_two function can be called from unbuffered_loop_nop2_ndim2 for the following code:
import numpy as np
a = np.random.rand(3, 10)
b = np.random.rand(3, 10)
for i in range(1):
ll = np.einsum('ij, ij -> i', a, b)
In this case, the double_sum_of_products_contig_contig_outstride0_two performs the reductions for a given output item and unbuffered_loop_nop2_ndim2 iterate over the output array.
If the expression ij, ij -> j is instead used in the above code, then the function double_sum_of_products_contig_two is called which operates the same way than double_sum_of_products_contig_contig_outstride0_two except it reads/writes on the whole output line during the reduction.

Ordering a two-dimensional array relative to the main diagonal

Given a two-dimensional array T of size NxN, filled with various natural numbers (They do not have to be sorted in any way as in the example below.). My task is to write a program that transforms the array in such a way that all elements lying above the main diagonal are larger than each element lying on the diagonal and all elements lying below the main diagonal are to be smaller than each element on the diagonal.
For example:
T looks like this:
[2,3,5][7,11,13][17,19,23] and one of the possible solutions is:
[13,19,23][3,7,17][5,2,11]
I have no clue how to do this. Would anyone have an idea what algorithm should be used here?

Let's say the matrix is NxN.
Put all N² values inside an array.
Sort the array with whatever method you prefer (ascending order).
In your final array, the (N²-N)/2 first values go below the diagonal, the following N values go to the diagonal, and the final (N²-N)/2 values go above the diagonal.
The following pseudo-code should do the job:
mat <- array[N][N] // To be initialized.
vec <- array[N*N]
for i : 0 to (N-1)
for j : 0 to (N-1)
vec[i*N+j]=mat[i][j]
next j
next i
sort(vec)
p_below <- 0
p_diag <- (N*N-N)/2
p_above <- (N*N+N)/2
for i : 0 to (N-1)
for j : 0 to (N-1)
if (i>j)
mat[i][j] = vec[p_above]
p_above <- p_above + 1
endif
if (i<j)
mat[i][j] = vec[p_below]
p_below <- p_below + 1
endif
if (i=j)
mat[i][j] = vec[p_diag]
p_diag <- p_diag + 1
endif
next j
next i
Code can be heavily optimized by sorting directly the matrix, using a (quite complex) custom sort operator, so it can be sorted "in place". Technically, you'll do a bijection between the matrix indices to a partitioned set of indices representing "below diagonal", "diagonal" and "above diagonal" indices.
But I'm unsure that it can be considered as an algorithm in itself, because it will be highly dependent on the language used AND on how you stored, internally, your matrix (and how iterators/indices are used). I could write one in C++, but I lack knownledge to give you such an operator in Python.
Obviously, if you can't use a standard sorting function (because it can't work on anything else but an array), then you can write your own with the tricky comparison builtin the algorithm.
For such small matrixes, even a bubble-sort can work properly, but obviously implementing at least a quicksort would be better.
Elements about optimizing:
First, we speak about the trivial bijection from matrix coordinate [x][y] to [i]: i=x+y*N. The invert is obviously x=floor(i/N) & y=i mod N. Then, you can parse the matrix as a vector.
This is already what I do in the first part initializing vec, BTW.
With matrix coordinates, it's easy:
Diagonal is all cells where x=y.
The "below" partition is everywhere x<y.
The "above" partition is everywhere x>y.
Look at coordinates in the below 3x3 matrix, it's quite evident when you know it.
0,0 1,0 2,0
0,1 1,1 2,1
0,2 1,2 2,2
We already know that the ordered vector will be composed of three parts: first the "below" partition, then the "diagonal" partition, then the "above" partition.
The next bijection is way more tricky, since it requires either a piecewise linear function OR a look-up table. The first requires no additional memory but will use more CPU power, the second use as much memory as the matrix but will require less CPU power.
As always, optimization for speed often cost memory. If memory is scarse because you use huge matrixes, then you'll prefer a function.
In order to shorten a bit, I'll explain only for "below" partition. In the vector, the (N-1) first elements will be the ones belonging to the first column. Then, we'll have (N-2) elements for the 2nd column, (N-3) for the third, until we had only 1 element for the (N-1)th column. You see the scheme, sum of the number of elements and the column (zero-based index) is always (N-1).
I won't write the function, because it's quite complex and, honestly, it won't help so much to understand. Simply know that converting from matrix indices to vector is "quite easy".
The opposite is more tricky and CPU-intensive, and it SHOULD use a (N-1) element vector to store where each column starts within the vector to GREATLY speed up the process. Thanks, this vector can also be used (from end to begin) for the "above" partition, so it won't burn too much memory.
Now, you can sort your "vector" normally, simply by chaining the two bijection together with the vector index, and you'll get a matrix cell instead. As long as the sorting algorithm is stable (that's usually the case), it will works and will sort your matrix "in place", at the expense of a lot of mathematical computing to "route" the linear indexes to matrix indexes.
Please note that, despite we speak about bijections, we need ONLY the "vector to matrix" formulas. The "matrix to vector" are important - it MUST be a bijection! - but you won't use them, since you'll sort directly the (virtual) vector from 0 to N²-1.

Is there an equivalent of Python's list and append feature in Matlab?

This is more of a Matlab programming question than it is a math question.
I'd like to run gradient descent multiple on different learning rates. I have a set of learning rates
alpha = [0.3, 0.1, 0.03, 0.01, 0.003, 0.001];
and each time I run gradient descent, I get a vector J_vals as output. However, I don't know Matlab well enough to know how to implement this besides doing something like:
[theta, J_vals] = gradientDescent(...., alpha(1),...);
J1 = J_vals;
[theta, J_vals] = gradientDescent(...., alpha(2),...);
J2 = J_vals;
and so on.
I thought about using a for loop, but then I don't know how I would deal with the J_vals's (not sure how to apply the for loop to J1, J2, and so on). Perhaps it would look something like this:
for i = len(alpha)
[theta, J_vals] = gradientDescent(..., alpha(i),...);
J(i) = J_vals;
end
Then I would have a vector of vectors.
In Python, I would just run a for loop and append each new result to the end of a list. How do I implement something like this in Matlab? Or is there a more efficient way?

If you know how many loops you are going have and the size of the J_vals (or at least a reasonable upper bound) I would suggest pre-allocating the size of the container array
J = zeros(n,1);
then on each loop insert the new values
J(start:start+n) = J_vals
That way you don't reallocate memory. If you don't know, you can append the values to the array. For example,
J = []; % initialize
for i = len(alpha)
[theta, J_vals] = gradientDescent(..., alpha(i),...);
J = [J; J_vals]; % Append column row
end
but this is re-allocating the size of the array every loop. If it's not too many loops then it should be ok.

Matlab's "cell arrays" are kind of like lists in Python. They are similar in that you can put variable datatypes into them. Nobody seems to be too sure, but most likely the cell array is implemented as an array of object pointers. That means that it is still somewhat expensive to append to it (cell_array{length(cell_array) + 1} = new_data), but at least you are only appending a pointer instead of the entire column. You would still have to convert the cell array to a normal matrix afterward using cell2mat.
The most idiomatic Matlab solution is to pre-allocate (as #dpmcmlxxvi suggested).
I think what you are describing is a really common use case, and it's unfortunate that Matlab requires such a verbose idiom for this. Also it's frustrating that the documentation is opaque on how cell arrays are implemented and whether it is expensive to append to a cell array.

Your solution works just fine as long as you add a : for the row subscript (assuming J_vals is a column vector):
for i = len(alpha)
[theta, J_vals] = gradientDescent(..., alpha(i),...);
J(:, i) = J_vals;
%// ^... all rows, column 'i'
end
You could even put that as the return value:
for i = len(alpha)
[theta, J(:, i)] = gradientDescent(..., alpha(i),...);
%// ^... add returned value directly to our list
end
Both of these methods allow you to preallocate your matrix for a potential speed gain.
If you want to build your list as you go, you can use the method in #dpmcmlxxvi's answer, or you can use the special subscript end. Neither of these methods are compatible with preallocation, though.
for i = len(alpha)
[theta, J(:, end+1)] = gradientDescent(..., alpha(i),...);
%// ^... add new vector after the current end of list
end
I would also like to suggest you not use i as a variable name in Matlab. I know it's natural for other languages, but in Matlab it overwrites the built-in imaginary constant i.
See: https://stackoverflow.com/a/14790765/1377097

Indices of scipy sparse csr_matrix

I have two scipy sparse csr matrices with the exact same shape but potentially different data values and nnz value. I now want to get the top 10 elements of one matrix and increase the value on the same indices on the other matrix. My current approach is as follows:
idx = a.data.argpartition(-10)[-10:]
i, j = matrix.nonzero()
i_idx = i[idx]
j_idx = j[idx]
b[i_idx, j_idx] += 1
The reason I have to go this way is that a.data and b.data do not necessarily have the same number of elements and hence the indices would differ.
My question now is whether I can improve this in some way. As far as I know the nonzero procedure is not elegant as I have to allocate two new arrays and I am very tough on memory already. I can get the j_indices via csr_matrix.indices but what about the i_indices? Can I use the indptr in a nice way for that?
Happy for any hints.

I'm not sure what the "top 10 elements" means. I assume that if you have matrices A and B you want to set B[i, j] += 1 if A[i, j] is within the first 10 nonzero entries of A (in CSR format).
I also assume that B[i, j] could be zero, which is the worst case performance wise, since you need to modify the sparsity structure of your matrix.
CSR is not an ideal format to use for changing sparsity structure. This is because every new insertion/deletion is O(nnzs) complexity (assuming the CSR storage is backed by an array - and it usually is).
You could use the DOK format for your second matrix (the one you are modifying), which provides O(1) access to elements.
I haven't benchmarked this but I suppose your option is 10 * O(nnzs) at worst, when you are adding 10 new nonzero values, whereas the DOK version should need O(nnzs) to build the matrix, then O(1) for each insertion and, finally, O(nnzs) to convert it back to CSR (assuming this is needed).

Large matrix multiplication in Python - what is the best option?

I have two boolean sparse square matrices of c. 80,000 x 80,000 generated from 12BM of data (and am likely to have orders of magnitude larger matrices when I use GBs of data).
I want to multiply them (which produces a triangular matrix - however I dont get this since I don't limit the dot product to yield a triangular matrix).
I am wondering what the best way of multiplying them is (memory-wise and speed-wise) - I am going to do the computation on a m2.4xlarge AWS instance which has >60GB of RAM. I would prefer to keep the calc in RAM for speed reasons.
I appreciate that SciPy has sparse matrices and so does h5py, but have no experience in either.
Whats the best option to go for?
Thanks in advance
UPDATE: sparsity of the boolean matrices is <0.6%

If your matrices are relatively empty it might be worthwhile encoding them as a data structure of the non-False values. Say a list of tuples describing the location of the non-False values. Or a dictionary with the tuples as the keys.
If you use e.g. a list of tuples you could use a list comprehension to find the items in the second list that can be multiplied with an element from the first list.
a = [(0,0), (3,7), (5,2)] # et cetera
b = ... # idem
for r, c in a:
res = [(r, k) for j, k in b if k == j]

-- EDITED TO SATISFY BELOW COMMENT / DOWNVOTER --
You're asking how to multiply matrices fast and easy.
SOLUTION 1: This is a solved problem: use numpy. All these operations are easy in numpy, and since they are implemented in C, are rather blazingly fast.
http://www.numpy.org/
http://www.scipy.org
also see:
Very large matrices using Python and NumPy
http://docs.scipy.org/doc/scipy/reference/sparse.html
SciPy and Numpy have sparse matrices and matrix multiplication. It doesn't use much memory since (at least if I wrote it in C) it probably uses linked lists, and thus will only use the memory required for the sum of the datapoints, plus some overhead. And, it will almost certainly be blazingly fast compared to pure python solution.
SOLUTION 2
Another answer here suggests storing values as tuples of (x, y), presuming value is False unless it exists, then it's true. Alternate to this is a numeric matrix with (x, y, value) tuples.
REGARDLESS: Multiplying these would be Nasty time-wise: find element one, decide which other array element to multiply by, then search the entire dataset for that specific tuple, and if it exists, multiply and insert the result into the result matrix.
SOLUTION 3 ( PREFERRED vs. Solution 2, IMHO )
I would prefer this because it's simpler / faster.
Represent your sparse matrix with a set of dictionaries. Matrix one is a dict with the element at (x, y) and value v being (with x1,y1, x2,y2, etc.):
matrixDictOne = { 'x1:y1' : v1, 'x2:y2': v2, ... }
matrixDictTwo = { 'x1:y1' : v1, 'x2:y2': v2, ... }
Since a Python dict lookup is O(1) (okay, not really, probably closer to log(n)), it's fast. This does not require searching the entire second matrix's data for element presence before multiplication. So, it's fast. It's easy to write the multiply and easy to understand the representations.
SOLUTION 4 (if you are a glutton for punishment)
Code this solution by using a memory-mapped file of the required size. Initialize a file with null values of the required size. Compute the offsets yourself and write to the appropriate locations in the file as you do the multiplication. Linux has a VMM which will page in and out for you with little overhead or work on your part. This is a solution for very, very large matrices that are NOT SPARSE and thus won't fit in memory.
Note this solves the complaint of the below complainer that it won't fit in memory. However, the OP did say sparse, which implies very few actual datapoints spread out in giant arrays, and Numpy / SciPy handle this natively and thus nicely (lots of people at Fermilab use Numpy / SciPy regularly, I'm confident the sparse matrix code is well tested).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.