how to multiply large matrices without getting memory error - python

I have a matrix X with (140000, 28) dims.
Calculating np.transpose(X) * X in python gives memory error.
how can I do the multiplication ?
Name:X, Type: Float64, Size:(140000L, 28L)... I think it is ndarray

Related

Memory error when converting matrix to sparse matrix, specified dtype is invalid

Purpose:
I want to invert the dense matrix first and then convert it to a sparse matrix, but it will report a memory error.
Description:
The original CSR sparse matrix (only 1 and 0) is:
<5910x4403279 sparse matrix of type '<class 'numpy.int64'>' with 73906823 stored elements in Compressed Sparse Row format>
I want to calculate the Tversky similarity between rows. Thus, I need to convert the sparse matrix to dense and invert the dense matrix firstly, and then use matrix * invert_matrix.T to calculate relative complement.
https://en.wikipedia.org/wiki/Tversky_index
So, I invert the dense matrix after change the dtype to "bool":
bool_mat = mat.astype(bool)
invert_arr = ~(bool_mat.todense())
invert_arr = invert_arr.astype("uint8")
invert_mat = np.asmatrix(invert_arr, dtype="uint8")
s_invert_mat = scipy.sparse.csr_matrix(invert_mat, dtype="uint8")
However, when I convert the invert_mat to csr matrix with dtype="uint8", memory error was raised.(I have tried dtype=int8,bool and uint8, but the program still throws memory error)
Error
Traceback (most recent call last):
File "/home/**/Project/**/Src/calculate_distance/tversky_distance.py", line 131, in <module>
tversjt_similarities(data)
File "/home/**/Project/**/Src/calculate_distance/tversky_distance.py", line 88, in tversjt_similarities
s_invert_mat = scipy.sparse.csr_matrix(invert_mat, dtype="uint8")
File "/home/**/.local/lib/python3.8/site-packages/scipy/sparse/compressed.py", line 86, in __init__
self._set_self(self.__class__(coo_matrix(arg1, dtype=dtype)))
File "/home/**/.local/lib/python3.8/site-packages/scipy/sparse/coo.py", line 189, in __init__
self.row, self.col = M.nonzero()
numpy.core._exceptions.MemoryError: Unable to allocate 387. GiB for an array with shape (25949472067, 2) and data type int64
Problem
The problem is: I have specify the dtype='uint8', but the dtype in the error message is int64, int64will require more memory.
I have searched related issues and found the problem: numpy will automatically convert int8 to int64.
int8 scipy sparse matrix creation errors creating int64 structure?
The core of this package was developed for linear algebra work (e.g. finite element ODE solutions). The csr format in particular is optimized for matrix multiplication. It uses compiled code (cython), which uses the standard c data types - integers, floats and doubles. In numpy terms that means int64, float32 and float64. Selected formats accept other dtypes like int8, but maintaining that dtype during conversions to other formats and calculations is difficult.
I know that the invert_mat is very dense, because there are so many 1s.
But is there a way to bypass or solve this problem?
I would be very grateful if anyone could give some advice

How do I optimise this numpy array operation?

This numpy operation gives a memory error. (Here X and Y are 2D arrays with shape (5000, 3072) and (500, 3072))
dists[:,:] = np.sqrt(np.sum(np.square(np.subtract(X, Y[:,np.newaxis])), axis=2))
I think the numpy array broadcasting is taking up a lot of memory. Is there any way to optimise the memory usage for these array operations?
Edit:
(This is a problem in Assignment 1 of cs231n). I found another solution that gives the same thing without a memory error:
dists[:,:] = np.sqrt((Y**2).sum(axis=1)[:, np.newaxis] + (X**2).sum(axis=1) - 2 * Y.dot(X.T))
Can you help me understand why my solution is inefficient memory wise?

Bus error: 10 when multiplying two matrices

When performing the multiplication of two matrices (matmul) in Numpy the program quits with the error Bus error: 10. No other error messages are presented.
The error occurs when performing the multiplication of two matrices.
out = w_col # x_col
w_col is the matrix of shape (64, 27) and with type float32.
x_col is the matrix of shape (27, 12996) and with type float32.
The error happens with any Numpy 10.15.* and Numpy 10.16.* on MacOS 10.14.3
Is that possible that issue is caused by large input matrices?

matrix to the power of a column of a dense matrix using numpy in Python

I'm trying to obtain all the values in a matrix beta VxK to the power of all the values in a column Vx1 that is part of a dense matrix VxN. So each value in beta should be to the power of the corresponding line in the column and this should be done for all K columns in beta. When I use np.power on python for a practice numpy array for beta using:
np.power(head_beta.T, head_matrix[:,0])
I am able to obtain the results I want. The dimensions are (3, 10) for beta and (10,) for head_matrix[:,0] where in this case 3=K and 10=V.
However, if I do this on my actual matrix, which was obtained by using
matrix=csc_matrix((data,(row,col)), shape=(30784,72407) ).todense()
where data, row, and col are arrays, I am unable to do the same operation:
np.power(beta.T, matrix[:,0])
where the dimensions are (10, 30784) for beta and (30784, 1) for matrix where in this case 10=K and 30784=V. I get the following error
ValueError Traceback (most recent call last)
<ipython-input-29-9f55d4cb9c63> in <module>()
----> 1 np.power(beta.T, matrix[:,0])
ValueError: operands could not be broadcast together with shapes (10,30784) (30784,1) `
It seems that the difference is that matrix is a matrix (length,1) and head_matrix is actually a numpy array (length,) that I created. How can I do this same operation with the column of a dense matrix?
In the problem case it can't broadcast (10,30784) and (30784,1). As you note it works when (10,N) is used with (N,). That's because it can expand the (N,) to (1,N) and on to (10,N).
M = sparse.csr_matrix(...).todense()
is np.matrix which is always 2d, so M(:,0) is (N,1). There are several solutons.
np.power(beta.T, M[:,0].T) # change to a (1,N)
np.power(beta, M[:,0]) # line up the expandable dimensions
convert the sparse matrix to an array:
A = sparse.....toarray()
np.power(beta.T, A[:,0])
M[:,0].squeeze() and M[:,0].ravel() both produce a (1,N) matrix. So does M[:,0].reshape(-1). That 2d quality is persistent, as long as it returns a matrix.
M[:,0].A1 produces a (N,) array
From a while back: Numpy matrix to array
You can use the squeeze method on arrays to get rid of this extra dimension.
So
np.power(beta.T, matrix[:,0].squeeze()) should do the trick.

TypeError from theano While using 3D numpy array

I am trying something similar to code below
datax=theano.shared(value=rng.rand(5,500,45))
x=T.dmatrix('x')
i=T.lscalar('i')
W=theano.shared(value=rng.rand(90,45,500))
Hb=theano.shared(value=np.zeros(90))
w_v_bias=T.dot(W,x).sum(axis=2).sum(axis=1)+Hb
z=theano.function([i],w_v_bias,givens={x:datax[i*5:(i+1)*5]})
z(0)
Theano is giving me a TypeError with msg:
Cannot convert Type TensorType(float64, 3D) (of Variable Subtensor{int64:int64:}.0) into Type TensorType(float64, matrix). You can try to manually convert Subtensor{int64:int64:}.0 into a TensorType(float64, matrix)
What I am doing wrong here?
Edit
As mentioned by daniel changing x to dtensor3 will result in another error.
ValueError: Input dimension mis-match. (input[0].shape[1] = 5, input[1].shape[1] = 90)
Apply node that caused the error: Elemwise{add,no_inplace}(Sum{axis=[1], acc_dtype=float64}.0, DimShuffle{x,0}.0)
Another way is to modify my train function but then I won't be able to do batch learning.
z=theano.function([x],w_v_bias)
z(datax[0])
I am trying to implement RBM with integer values for visible units.
The problem is that datax is a 3D tensor and datax[index*5:(index+1)*5] is also a 3D tensor but you're trying to assign that to x which is a 2D tensor (i.e. a matrix).
Changing
x = T.dmatrix('x')
to
x = T.dtensor3('x')
solves this problem but creates a new one because the dimensions of W and x don't match up to perform the dot product. It's unclear what the desired outcome is.
Solved it after few hit and trials.
What I needed was to change
x=T.dmatrix('x')
w_v_bias=T.dot(W,x).sum(axis=2).sum(axis=1)+Hb
to
x=T.dtensor3('x')
w_v_bias=T.dot(x,W).sum(axis=3).sum(axis=1)+Hb
Now it produces (5,90) array after adding Hb elementwise to each of the five vectors of dot product.

Categories

Resources