Rule based scipy sparse dot product

Rule based scipy sparse dot product - python

I have one large scipy csr_matrix and want to calculate matrix.dot(matrix.T). However, I do not need every single dot product, but rather only those based on some rules. For eample, as specified by another binary matrix that has nonzero elements for those rows/columns that should be calculated. For eample, if rules[0,10]=1 then, the dot product between row 0 and row 10 should be determined. These rules could of course also be represented by some other data structure.
A simple solution would be to manually loop through the rules and then slice according rows/columns and determine the dot product. This does not seem to be the best solution to me, specifically as slicing is also quite expensive with sparse matrices. Maybe, someone has a better idea about how to approach this.

Related

elimination the linear dependent columns of a non-square matrix in python

I have a matrix A = np.array([[1,1,1],[1,2,3],[4,4,4]]) and I want only the linearly independent rows in my new matrix. The answer might be A_new = np.array([1,1,1],[1,2,3]]) or A_new = np.array([1,2,3],[4,4,4])
Since I have a very large matrix so I need to decompose the matrix into smaller linearly independent full rank matrix. Can someone please help?

There are many ways to do this, and which way is best will depend on your needs. And, as you noted in your statement, there isn't even a unique output.
One way to do this would be to use Gram-Schmidt to find an orthogonal basis, where the first $k$ vectors in this basis have the same span as the first $k$ independent rows. If at any step you find a linear dependence, drop that row from your matrix and continue the procedure.
A simple way do do this with numpy would be,
q,r = np.linalg.qr(A.T)
and then drop any columns where R_{i,i} is zero.
For instance, you could do
A[np.abs(np.diag(R))>=1e-10]
While this will work perfectly in exact arithmetic, it may not work as well in finite precision. Almost any matrix will be numerically independent, so you will need some kind of thresholding to determine if there is a linear dependence. If you use the built in QR method, you will have to make sure that there is no dependence on columns which you previously dropped.
If you need even more stability, you could iteratively solve the least squares problem
A.T[:,dependent_cols] x = A.T[:,col_to_check]
using a stable direct method. If you can solve this exactly, then A.T[:,k] is dependent on the previous vectors, with the combination given by x.
Which solver to use may also be dictated by your data type.

Is there a way to tell what makes a particular numpy array singular?

I am trying to generate a few very large arrays, and at least one is ending up being singular, which is made obvious by this familiar error message:
File "C:\Anaconda3\lib\site-packages\numpy\linalg\linalg.py", line 90, in _raise_linalgerror_singular
raise LinAlgError("Singular matrix")
LinAlgError: Singular matrix
Of course I do not want my array to be singular, but I am more interested in determining WHY my array is singular. What I mean by this is that I would like to have a way to answer the following questions without manually checking each entry:
Is the array square? (I believe this is returned by a separate error message, which is convenient, but I'll include this as a singularity property anyway)
Are any rows populated only by zeros?
Are any columns populated only by zeros?
Are any rows not linearly independent of all other rows?
For relatively small arrays, the first two conditions are easily answered by visual inspection. However, because my arrays are substantially large, I do not want to have to go in and manually check each array element to see if any of those conditions are met.
I tried pulling up the linalg.py script to see if I could see how it determines a matrix to be singular, but I could not tell how it determines a matrix to be singular.
(this paragraph was edited for clarity)
I also tried searching for info online, and nothing seemed to be of help. Most topics seemed to only answer some form of the following questions/objectives: 1) "I want Python to tell me if my matrix is singular" or 2) why is Python giving me this error message". Because I already know that my matrix/matrices are singular, neither of these two questions are of importance to me.
Again, I am not looking for an answer along the lines of, "Oh, well this particular matrix is singular because . . .". I am looking for a method I can use immediately on ANY singular matrix to determine (especially for large arrays) what is causing the singularity.
Is there a built-in Python function that does this, or is there some other relatively simple way to do this before I try to create a function that will do this for me?

Singular matrices have at least one eigenvalue equal to zero. You can create a diagonalizable singular matrix by starting from its eigenvalue decomposition:
A = V D V^{-1}
D is the diagonal matrix of eigenvalues. So create any matrix V, the diagonal matrix D that has at least one zero in the diagonal, and then A will be singular.

The traditional way of checking is by computing an SVD. This is what the function numpy.linalg.matrix_rank uses to compute the rank, and you can then check if matrix_rank(M) == M.shape[0] (assuming a square matrix).
For more information, check out this excellent answer to a similar question for Matlab users.
The rank of the matrix will tell you how many rows aren't zero or linear combinations, but not specifically which ones. It's a relatively fast operation, so it might be useful as a first-pass check.

Array for Large Data

I need to form a 2D matrix with total size 2,886 X 2,003,817. I try to use numpy.zeros to make a 2D zero element matrix and then calculate and assign each element of Matrix (most of them are zero son I need to replace few of them).
but when I try numpy.zero to initialize my matrix I get the following memory error:
C=numpy.zeros((2886,2003817)) "MemoryError"
I also try to form the matrix without initialization. Basically I calculate the element of each row in each iteration of my algorithm and then
C=numpy.concatenate((C,[A]),axis=0)
in which C is my final matrix and A is the calculated row at the current iteration. But I find out this method takes a lots of time, I am guessing it is because of using numpy.concatenate(?)
could you please let me know if there is a way to avoid memory error in initializing my matrix or is there any better method or suggestion to form the matrix in this size?
Thanks,
Amir

If your data has a lot of zeros in it, you should use scipy.sparse matrix.
It is a special data structure designed to save memory for matrices that have a lot of zeros. However, if your matrix is not that sparse, sparse matrices start to take up more memory. There are many kinds of sparse matrices, and each of them is efficient at one thing, while inefficient at another thing, so be careful with what you choose.

SciPy NumPy and SciKit-learn , create a sparse matrix

I'm currently trying to classify text. My dataset is too big and as suggested here, I need to use a sparse matrix. My question is now, what is the right way to add an element to a sparse matrix? Let's say for example I have a matrix X which is my input .
X = np.random.randint(2, size=(6, 100))
Now this matrix X looks like an ndarray of an ndarray (or something like that).
If I do
X2 = csr_matrix(X)
I have the sparse matrix, but how can I add another element to the sparce matrix ?
for example this dense element: [1,0,0,0,1,1,1,0,...,0,1,0] to a sparse vector, how do I add it to the sparse input matrix ?
(btw, I'm very new at python, scipy,numpy,scikit ... everything)

Scikit-learn has a great documentation, with great tutorials that you really should read before trying to invent it yourself. This one is the first one to read it explains how to classify text, step-by-step, and this one is a detailed example on text classification using sparse representation.
Pay extra attention to the parts where they talk about sparse representations, in this section. In general, if you want to use svm with linear kernel and you large amount of data, LinearSVC (which is based on Liblinear) is better.
Regarding your question - I'm sure there are many ways to concatenate two sparse matrices (btw this is what you should look for in google for other ways of doing it), here is one, but you'll have to convert from csr_matrix to coo_matrix which is anther type of sparse matrix: Is there an efficient way of concatenating scipy.sparse matrices?.
EDIT: When concatenating two matrices (or a matrix and an array which is a 1 dimenesional matrix) the general idea is to concatenate X1.data and X2.data and manipulate their indices and indptrs (or row and col in case of coo_matrix) to point to the correct places. Some sparse representations are better for specific operations and more complex for other operations, you should read about csr_matrix and see if this is the best representation. But I really urge you to start from those tutorials I posted above.

How to assemble large sparse matrices effectively in python/scipy

I am working on an FEM project using Scipy. Now my problem is, that
the assembly of the sparse matrices is too slow. I compute the
contribution of every element in dense small matrices (one for each
element). For the assembly of the global matrices I loop over all
small dense matrices and set the matrice entries the following way:
[i,j] = someList[k][l]
Mglobal[i,j] = Mglobal[i,j] + Mlocal[k,l]
Mglobal is a lil_matrice of appropriate size, someList maps the
indexing variables.
Of course this is rather slow and consumes most of the matrice
assembly time. Is there a better way to assemble a large sparse matrix
from many small dense matrices? I tried scipy.weave but it doesn't
seem to work with sparse matrices

I posted my response to the scipy mailing list; stack overflow is a bit easier
to access so I will post it here as well, albeit a slightly improved version.
The trick is to use the IJV storage format. This is a trio of three arrays
where the first one contains row indicies, the second has column indicies, and
the third has the values of the matrix at that location. This is the best way
to build finite element matricies (or any sparse matrix in my opinion) as access
to this format is really fast (just filling an an array).
In scipy this is called coo_matrix; the class takes the three arrays as an
argument. It is really only useful for converting to another format (CSR os
CSC) for fast linear algebra.
For finite elements, you can estimate the size of the three arrays by something
like
size = number_of_elements * number_of_basis_functions**2
so if you have 2D quadratics you would do number_of_elements * 36, for example.
This approach is convenient because if you have local matricies you definitely
have the global numbers and entry values: exactly what you need for building
the three IJV arrays. Scipy is smart enough to throw out zero entries, so
overestimating is fine.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.