Memory issues with creating an adjacency matrix using Coo-matrix - python

Hi i am trying to generate an adjacency matrix with a dimension of about 24,000 from a CSV with two columns showing combinations of pairs of genes and a column of 1's to indicate a present interaction....My goal is to have it be square and populated with zeros for combinations not in the two columns
I am using the following Python script
import numpy as np
from scipy.sparse import coo_matrix
l, c, v = np.loadtxt("biogrid2.csv", dtype=(int), skiprows=0, delimiter=",").T[:3, :]
m =coo_matrix((l, (v-1, c-1)), shape=(v.max(), c.max()))
m.toarray()
and it runs ok until encountering the following errorIt seems
File "/home/charlie/anaconda3/lib/python3.6/site-packages/scipy/sparse/base.py", line 1184, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
Any ideas about how to get around the memory limit in Scipy
Thanks

Most likely what you want isn't m.toarray but m.tocsr(). a csr matrix can do simple linear algebra (like .dot() and matrix powers) natively, for instance this works:
m.tocsr()
random_walk_2 = m.dot(m)
random_walk_n = m ** n
# see https://stackoverflow.com/questions/28702416/matrix-power-for-sparse-matrix-in-python
Covariance should be implementable as well, but I'm not sure what the specific implementation would be without seeing what your current process is.
EDIT: To turn the output back into a simpler format to read out to csv, you can follow up by returning to coo with .tocoo()
m.tocoo()
out = np.c_[m.data, m.row, m.col].T
np.savetxt("foo.csv", out, delimiter=",")
# see https://stackoverflow.com/questions/6081008/dump-a-numpy-array-into-a-csv-file

The function toarray() will convert your 24000*24000 sparse matrix (coo_matrix) into a dense array of 24000*24000 (assuming you are loading int) which needs in terms of memory at least
24000*24000*4 = around 2,15Gb.
To avoid using so much memory you should avoid converting to dense matrix (using toarray()) and do your operations with sparse matrix
If you need your matrix squared you can just do m*m or m.multiply(m) and you will get a sparse matrix.
To save your matrix you have several option.
Simplest one is NPZ see https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.sparse.save_npz.html or Save / load scipy sparse csr_matrix in portable data format
If you want to get your result as your initial CSV file coo_matrix has attributes
data COO format data array of the matrix
row COO format row index array of the matrix
col COO format column index array of the matrix
see https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html
which can be used to create the CSV file.

Related

Pandas sparse dataframe multiplication

I have two pandas sparse dataframes, big_sdf and bigger_sdf.
When I try to multiply them:
result = big_sdf # bigger_sdf
I get an error:
"numpy.core._exceptions.MemoryError: Unable to allocate 3.6 TiB for an array with shape (160815, 3078149) and data type int64"
So I tried to convert these sparse dataframes to SciPy's csr matrices and multiply it, but the conversion doesn't succeed:
from scipy.sparse import csr_matrix
csr_big = csr_matrix(big_sdf)
csr_bigger = csr_matrix(bigger_sdf)
When I run the last row I get an error message:
"ValueError: unrecognized csr_matrix constructor usage"
It only happens for the bigger matrix, the smaller one is converted with success.
Any ideas? Maybe there's a Pandas native method to multiply sparse dataframes which I missed?
Thanks in advance!

construct a large dask-backed xarray from an iterator of row vectors

How can I build xarray from from an iterator of row vectors.
The resulting array may be larger than memory and will be backed by a dask array.
The row vectors also come with unique labels which need to become the row index of the resulting xarray.
In the docs I only see a constructor that takes an in memory numpy array to begin with.
An example use case would be to store a word embedding model as an xarray with words as row labels. These models usually provide an iterator that produces (string, vector) pairs over all words in the vocabulary. Most models have have in the 100s of dimensions and there are usually ~10^6 words in the vocabulary. I would like to stack the vectors into a matrix in order to perform linear algebra operations and also be able to look up rows by the word string.
I would expect to be able to write something like:
import numpy as np
import xarray as xr
vectors = (('V'+str(i), np.random.randn(10000)) for i in range(10**9))
xray = xarray_from_iter(vectors)
xray.to_parquet('big_xarray.parquet')
row1234567 = xray['V1234567']
Does xarray provide something like xarray_from_iter?
If not how do I write it?
xarray_from_iter should work something like numpy.fromiter
except that it should also label the rows as it goes.
It would also need to delay the computation until dump was called,
since the whole issue is that the that array is larger than memory.
TLDR; xarray does not have a from iterator constructor. You'll have to build your dask arrays yourself.
Also, xarray does not have a to_parquet method so that is not an operation you can do (at the moment).
Here is an example of how you might construct a dask array (and xarray.DataArray) for your use case:
import dask.array
import xarray as xr
import numpy as np
num = 10
names = []
arrays = []
for i in range(num):
names.append('V'+str(i))
arrays.append(dask.array.random.random(10000, chunks=(1000,)))
da = xr.DataArray(data, dims=('model', 'sample'), coords={'model': names})
print(da)
Yielding:
<xarray.DataArray 'stack-ff07239b7ea24834ba59f2d05b7f41e2' (model: 10,
sample: 10000)>
dask.array<shape=(10, 10000), dtype=float64, chunksize=(1, 1000)>
Coordinates:
* model (model) <U2 'V0' 'V1' 'V2' 'V3' 'V4' 'V5' 'V6' 'V7' 'V8' 'V9'
Dimensions without coordinates: sample
This is not likely to be efficient, especially when the length of the iterator gets large (like in your example). It may be worth proposing such a constructor on the dask github issues page.

"Killed: 9" error when trying to construct a Scipy csr_matrix from a large NumPy array

I'm trying to solve a Markov chain problem in which the transition matrix contains about ~150,000 rows and columns, which is however sparse (only about ~450,000 elements are nonzero).
I notice that trying to construct a csr_matrix matrix from a np.zeros array of that size leads to a Killed: 9 error:
In [139]: N = 150000
In [140]: T = np.zeros((N, N))
In [142]: import scipy.sparse
In [143]: _T = scipy.sparse.csr_matrix(T)
Killed: 9
Is it possible to construct a csr_matrix of this size? Do I need to initiate the matrix T as a csr_matrix and dispense with NumPy arrays altogether?
Your process is "killed: 9" mostly because the process is taking too long or too much memory of the system and it's been terminated by the os. Just like in the comment, you can construct a sparse matrix directly using csr_matrix:
_T = scipy.sparse.csr_matrix((N,N))

Put multiple 2d numpy arrays into 3d numpy array

I'm trying to put multiple 2-D numpy arrays into one 3-D numpy array and then save the 3-D numpy array as a compressed file to a directory for later use.
I have a list that I'm looping through which will compute forecasts for different hazards. A forecast for each hazard (a 129x185 numpy array) will be computed one at a time. I want to then put each forecast array into an empty 129x185x7 numpy array.
hazlist = ['allsvr', 'torn', 'sigtorn', 'hail', 'sighail', 'wind', 'sigwind']
# Create 3-D empty numpy array
grid = np.zeros(shape=(129,185,7))
for i,haz in enumerate(hazlist):
*do some computation to create forecast array for current hazard*
# Now have 2-D 129x185 forecast array
print fcst
# Place 2-D array into empty 3-D array.
*Not sure how to do this...*
# Save 3-D array to .npz file in directory when all 7 hazard forecasts are done.
np.savez_compressed('pathtodir/3dnumpyarray.npz')
But, I want to give each forecast array it's own grid name inside the 3-D array so that if I want a certain one (like tornadoes) I can just call it with:
filename = np.load('pathtodir/3dnumpyarray.npz')
arr = filename['torn']
It would be greatly appreciated if someone were able to assist me. Thanks.
It sounds like you actually want to use a dictionary. Each dictionary entry could be a 2D array with the reference name as the key:
hazlist = ['allsvr', 'torn', 'sigtorn', 'hail', 'sighail', 'wind', 'sigwind']
# Create empty dictionary
grid = {}
for i,haz in enumerate(hazlist):
*do some computation to create forecast array for current hazard*
# Now have 2-D 129x185 forecast array
print fcst
# Place 2-D array into dictionary.
grid[haz] = fcst # Assuming fcst is the 2D array?
# Save 3-D array to npz file
np.savez_compressed("output", grid)
It might be best to save this as a JSON file. If the data needs to be compressed you can refer to this question and answer as to saving json in gzipped format, or this one may be clearer.
It's not clear from your example, but my assumption in the above code is that fcst is the 2D array that corresponds to the label haz in each iteration of the loop.

Huge sparse matrix in python

I need to iteratively construct a huge sparse matrix in numpy/scipy. The intitialization is done within a loop:
from scipy.sparse import dok_matrix, csr_matrix
def foo(*args):
dim_x = 256*256*1024
dim_y = 128*128*512
matrix = dok_matrix((dim_x, dim_y))
for i in range(dim_x):
# compute stuff in order to get j
matrix[i, j] = 1.
return matrix.tocsr()
Then i need to convert it to a csr_matrix, because of further computations like:
matrix = foo(...)
result = matrix.T.dot(x)
At the beginning this was working fine. But my matrices are getting bigger and bigger and my computer starts to crash. Is there a more elegant way in storing the matrix?
Basically i have the following requirements:
The matrix needs to store float values form 0. to 1.
I need to compute the transpose of the matrix
I need to compute the dot product with a x_dimensional vector
The matrix dimensions can be around 1*10^9 x 1*10^8
My ram-storage is exceeding. I was reading several posts on stack overflow and the rest of the internet ;) I found PyTables, which isn't really made for matrix computations... etc.. Is there a better way?
For your case I would recommend using the data type np.int8 (or np.uint8) which require only one byte per element:
matrix = dok_matrix((dim_x, dim_y), dtype=np.int8)
Directly constructing the csr_matrix will also allow you to go further with the maximum matrix size:
from scipy.sparse import csr_matrix
def foo(*args):
dim_x = 256*256*1024
dim_y = 128*128*512
row = []
col = []
for i in range(dim_x):
# compute stuff in order to get j
row.append(i)
col.append(j)
data = np.ones_like(row, dtype=np.int8)
return csr_matrix((data, (row, col)), shape=(dim_x, dim_y), dtype=np.int8)
You may have hit the limits of what Python can do for you, or you may be able to do a little more. Try setting a datatype of np.float32, if you're on a 64 bit machine, this reduced precision may reduce your memory consumption. np.float16 may help you on memory even further, but your calculations may slow down (I've seen examples where processing may take 10x the amount of time):
matrix = dok_matrix((dim_x, dim_y), dtype=np.float32)
or possibly much slower, but even less memory consumption:
matrix = dok_matrix((dim_x, dim_y), dtype=np.float16)
Another option: buy more system memory.
Finally, if you can avoid creating your matrix with dok_matrix, and can create it instead with csr_matrix (I don't know if this is possible for your calculations) you may save a little overhead on the dict that dok_matrix uses.

Categories

Resources