What's the best way to serialize a large scipy sparse matrix? - python

I have a large scipy sparse matrix, which is taking up >90% of my total system memory. I would like to save it to disk, as it takes hours to build the matrix...
I tried cPickle, but that leads to a major memory explosion...
import numpy as np
from scipy.sparse import lil_matrix
import cPickle
dim = 10**8
M = lil_matrix((dim, dim), dtype=np.float)
with open(filename, 'wb') as f:
cpickle.dump(M, f) # leads to a major memory explosion, presumably there is lots of copying
while HDF5 didn't like the datatype: TypeError: Object dtype dtype('O') has no native HDF5 equivalent
So what should I do?

Pickling is very memory inefficient, unfortunately. I would recommend accessing the underlying data array attributes of the sparse matrix, and storing those in an efficient manner, such as hdf5. Reconstructing a sparse matrix from a triplet of row/column/data vectors should be easy.

It depends on how much data is actually stored in the matrix. Have you looked at converting the matrix type before serialisation?
The LIL matrix is not the most memory efficient sparse matrix you have available. You could look at converting to either DIA, COO or DOK before pickling.
For example:
In [43]: dim = 10**6
In [44]: M = lil_matrix((dim, dim), dtype=np.float)
In [45]: for ii in range(10000):
M[np.random.uniform(0,dim),np.random.uniform(0,dim)] = 1
In [46]: len(cPickle.dumps(M.todok()))
Out[46]: 1256302
In [47]: len(cPickle.dumps(M.tocoo()))
Out[47]: 557691
# compared to
In [48]: len(cPickle.dumps(M))
Out[48]: 23018393
These formats don't all support the same set of operations, but conversion between the formats is trivial.

Related

Fast and efficient way of serializing and retrieving a large number of numpy arrays from HDF5 file

I have a huge list of numpy arrays, specifically 113287, where each array is of shape 36 x 2048. In terms of memory, this amounts to 32 Gigabytes.
As of now, I have serialized these arrays as a giant HDF5 file. Now, the problem is that retrieving individual arrays from this hdf5 file takes excruciatingly long time (north of 10 mins) for each access.
How can I speed this up? This is very important for my implementation since I have to index into this list several thousand times for feeding into Deep Neural Networks.
Here's how I index into hdf5 file:
In [1]: import h5py
In [2]: hf = h5py.File('train_ids.hdf5', 'r')
In [5]: list(hf.keys())[0]
Out[5]: 'img_feats'
In [6]: group_key = list(hf.keys())[0]
In [7]: hf[group_key]
Out[7]: <HDF5 dataset "img_feats": shape (113287, 36, 2048), type "<f4">
# this is where it takes very very long time
In [8]: list(hf[group_key])[-1].shape
Out[8]: (36, 2048)
Any ideas where I can speed things up? Is there any other way of serializing these arrays for faster access?
Note: I'm using a Python list since I want the order to be preserved (i.e. to retrieve in the same order as I put it when I created the hdf5 file)
According to Out[7], "img_feats" is a large 3d array. (113287, 36, 2048) shape.
Define ds as the dataset (doesn't load anything):
ds = hf[group_key]
x = ds[0] # should be a (36, 2048) array
arr = ds[:] # should load the whole dataset into memory.
arr = ds[:n] # load a subset, slice
According to h5py-reading-writing-data :
HDF5 datasets re-use the NumPy slicing syntax to read and write to the file. Slice specifications are translated directly to HDF5 “hyperslab” selections, and are a fast and efficient way to access data in the file.
I don't see any point in wrapping that in list(); that is, in splitting the 3d array in a list of 113287 2d arrays. There's a clean mapping between 3d datasets on the HDF5 file and numpy arrays.
h5py-fancy-indexing warns that fancy indexing of a dataset is slower. That is, seeking to load, say [1, 1000, 3000, 6000] subarrays of that large dataset.
You might want to experiment with writing and reading some smaller datasets if working with this large one is too confusing.
One way would be to put each sample into its own group and index directly into those. I am thinking the conversion takes long because it tries to load the entire data set into a list (which it has to read from disk). Re-organizing the h5 file such that
group
sample
36 x 2048
may help in indexing speed.

Numpy Matrix Memory size low compared to Numpy Array

I have a .npz file which I want to load into RAM . The compressed file size is 30MB . I am doing the following operation to load the data into RAM.
import numpy as np
from scipy import sparse
from sys import getsizeof
a = sparse.load_npz('compressed/CRS.npz').todense()
getsizeof(a)
# 136
type(a)
# numpy.matrixlib.defmatrix.matrix
b = np.array(a)
getsizeof(b)
# 64000112
type(b)
# numpy.ndarray
Why numpy.matrix object occupy very low memory size compared to numpy.arrray ? Both a and b have same dimension and data.
Your a matrix is a view of another array, so the underlying data is not counted towards its getsizeof. You can see this by checking that a.base is not None, or by seeing that the OWNDATA flag is False in a.flags.
Your b array is not a view, so the underlying data is counted towards its getsizeof.
numpy.matrix doesn't provide any memory savings.

Memory issues with creating an adjacency matrix using Coo-matrix

Hi i am trying to generate an adjacency matrix with a dimension of about 24,000 from a CSV with two columns showing combinations of pairs of genes and a column of 1's to indicate a present interaction....My goal is to have it be square and populated with zeros for combinations not in the two columns
I am using the following Python script
import numpy as np
from scipy.sparse import coo_matrix
l, c, v = np.loadtxt("biogrid2.csv", dtype=(int), skiprows=0, delimiter=",").T[:3, :]
m =coo_matrix((l, (v-1, c-1)), shape=(v.max(), c.max()))
m.toarray()
and it runs ok until encountering the following errorIt seems
File "/home/charlie/anaconda3/lib/python3.6/site-packages/scipy/sparse/base.py", line 1184, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
Any ideas about how to get around the memory limit in Scipy
Thanks
Most likely what you want isn't m.toarray but m.tocsr(). a csr matrix can do simple linear algebra (like .dot() and matrix powers) natively, for instance this works:
m.tocsr()
random_walk_2 = m.dot(m)
random_walk_n = m ** n
# see https://stackoverflow.com/questions/28702416/matrix-power-for-sparse-matrix-in-python
Covariance should be implementable as well, but I'm not sure what the specific implementation would be without seeing what your current process is.
EDIT: To turn the output back into a simpler format to read out to csv, you can follow up by returning to coo with .tocoo()
m.tocoo()
out = np.c_[m.data, m.row, m.col].T
np.savetxt("foo.csv", out, delimiter=",")
# see https://stackoverflow.com/questions/6081008/dump-a-numpy-array-into-a-csv-file
The function toarray() will convert your 24000*24000 sparse matrix (coo_matrix) into a dense array of 24000*24000 (assuming you are loading int) which needs in terms of memory at least
24000*24000*4 = around 2,15Gb.
To avoid using so much memory you should avoid converting to dense matrix (using toarray()) and do your operations with sparse matrix
If you need your matrix squared you can just do m*m or m.multiply(m) and you will get a sparse matrix.
To save your matrix you have several option.
Simplest one is NPZ see https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.sparse.save_npz.html or Save / load scipy sparse csr_matrix in portable data format
If you want to get your result as your initial CSV file coo_matrix has attributes
data COO format data array of the matrix
row COO format row index array of the matrix
col COO format column index array of the matrix
see https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html
which can be used to create the CSV file.

"Killed: 9" error when trying to construct a Scipy csr_matrix from a large NumPy array

I'm trying to solve a Markov chain problem in which the transition matrix contains about ~150,000 rows and columns, which is however sparse (only about ~450,000 elements are nonzero).
I notice that trying to construct a csr_matrix matrix from a np.zeros array of that size leads to a Killed: 9 error:
In [139]: N = 150000
In [140]: T = np.zeros((N, N))
In [142]: import scipy.sparse
In [143]: _T = scipy.sparse.csr_matrix(T)
Killed: 9
Is it possible to construct a csr_matrix of this size? Do I need to initiate the matrix T as a csr_matrix and dispense with NumPy arrays altogether?
Your process is "killed: 9" mostly because the process is taking too long or too much memory of the system and it's been terminated by the os. Just like in the comment, you can construct a sparse matrix directly using csr_matrix:
_T = scipy.sparse.csr_matrix((N,N))

Huge sparse matrix in python

I need to iteratively construct a huge sparse matrix in numpy/scipy. The intitialization is done within a loop:
from scipy.sparse import dok_matrix, csr_matrix
def foo(*args):
dim_x = 256*256*1024
dim_y = 128*128*512
matrix = dok_matrix((dim_x, dim_y))
for i in range(dim_x):
# compute stuff in order to get j
matrix[i, j] = 1.
return matrix.tocsr()
Then i need to convert it to a csr_matrix, because of further computations like:
matrix = foo(...)
result = matrix.T.dot(x)
At the beginning this was working fine. But my matrices are getting bigger and bigger and my computer starts to crash. Is there a more elegant way in storing the matrix?
Basically i have the following requirements:
The matrix needs to store float values form 0. to 1.
I need to compute the transpose of the matrix
I need to compute the dot product with a x_dimensional vector
The matrix dimensions can be around 1*10^9 x 1*10^8
My ram-storage is exceeding. I was reading several posts on stack overflow and the rest of the internet ;) I found PyTables, which isn't really made for matrix computations... etc.. Is there a better way?
For your case I would recommend using the data type np.int8 (or np.uint8) which require only one byte per element:
matrix = dok_matrix((dim_x, dim_y), dtype=np.int8)
Directly constructing the csr_matrix will also allow you to go further with the maximum matrix size:
from scipy.sparse import csr_matrix
def foo(*args):
dim_x = 256*256*1024
dim_y = 128*128*512
row = []
col = []
for i in range(dim_x):
# compute stuff in order to get j
row.append(i)
col.append(j)
data = np.ones_like(row, dtype=np.int8)
return csr_matrix((data, (row, col)), shape=(dim_x, dim_y), dtype=np.int8)
You may have hit the limits of what Python can do for you, or you may be able to do a little more. Try setting a datatype of np.float32, if you're on a 64 bit machine, this reduced precision may reduce your memory consumption. np.float16 may help you on memory even further, but your calculations may slow down (I've seen examples where processing may take 10x the amount of time):
matrix = dok_matrix((dim_x, dim_y), dtype=np.float32)
or possibly much slower, but even less memory consumption:
matrix = dok_matrix((dim_x, dim_y), dtype=np.float16)
Another option: buy more system memory.
Finally, if you can avoid creating your matrix with dok_matrix, and can create it instead with csr_matrix (I don't know if this is possible for your calculations) you may save a little overhead on the dict that dok_matrix uses.

Categories

Resources