Out-of-core processing of sparse CSR arrays - python

How can one apply some function in parallel on chunks of a sparse CSR array saved on disk using Python? Sequentially this could be done e.g. by saving the CSR array with joblib.dump opening it with joblib.load(.., mmap_mode="r") and processing the chunks of rows one by one. Could this be done more efficiently with dask?
In particular, assuming one doesn't need all the possible out of core operations on sparse arrays, but just the ability to load row chunks in parallel (each chunk is a CSR array) and apply some function to them (in my case it would be e.g. estimator.predict(X) from scikit-learn).
Besides, is there a file format on disk that would be suitable for this task? Joblib works but I'm not sure about the (parallel) performance of CSR arrays loaded as memory maps; spark.mllib appears to use either some custom sparse storage format (that doesn't seem to have a pure Python parser) or LIBSVM format (the parser in scikit-learn is, in my experience, much slower than joblib.dump)...
Note: I have read documentation, various issues about it on https://github.com/dask/dask/ but I'm still not sure how to best approach this problem.
Edit: to give a more practical example, below is the code that works in dask for dense arrays but fails when using sparse arrays with this error,
import numpy as np
import scipy.sparse
import joblib
import dask.array as da
from sklearn.utils import gen_batches
np.random.seed(42)
joblib.dump(np.random.rand(100000, 1000), 'X_dense.pkl')
joblib.dump(scipy.sparse.random(10000, 1000000, format='csr'), 'X_csr.pkl')
fh = joblib.load('X_dense.pkl', mmap_mode='r')
# computing the results without dask
results = np.vstack((fh[sl, :].sum(axis=1)) for sl in gen_batches(fh.shape[0], batch_size))
# computing the results with dask
x = da.from_array(fh, chunks=(2000))
results = x.sum(axis=1).compute()
Edit2: following the discussion below, the example below overcomes the previous error but gets ones about IndexError: tuple index out of range in dask/array/core.py:L3413,
import dask
# +imports from the example above
dask.set_options(get=dask.get) # disable multiprocessing
fh = joblib.load('X_csr.pkl', mmap_mode='r')
def func(x):
if x.ndim == 0:
# dask does some heuristics with dummy data, if the x is a 0d array
# the sum command would fail
return x
res = np.asarray(x.sum(axis=1, keepdims=True))
return res
Xd = da.from_array(fh, chunks=(2000))
results_new = Xd.map_blocks(func).compute()

So I don't know anything about joblib or dask, let alone your application specific data format. But it is actually possible to read sparse matrices from disk in chunks while retaining the sparse data structure.
While the Wikipedia article for the CSR format does a great job explaining how it works, I'll give a short recap:
Some sparse Matrix, e.g.:
1 0 2
0 0 3
4 5 6
is stored by remembering each nonzero-value and the column it resides in:
sparse.data = 1 2 3 4 5 6 # acutal value
sparse.indices = 0 2 2 0 1 2 # number of column (0-indexed)
Now we are still missing the rows. The compressed format just stores how many non-zero values there are in each row, instead of storing every single values row.
Note that the non-zero count is also accumulated, so the following array contains the number of non-zero values up until and including this row. To complicate things even further, the array always starts with a 0 and thus contains num_rows+1 entries:
sparse.indptr = 0 2 3 6
so up until and including the second row there are 3 nonzero values, namely 1, 2 and 3.
Since we got this sorted out, we can start 'slicing' the matrix. The goal is to construct the data, indices and indptr arrays for some chunks. Assume the original huge matrix is stored in three binary files, which we will incrementally read. We use a generator to repeatedly yield some chunk.
For this we need to know how many non-zero values are in each chunk, and read the according amount of values and column-indices. The non-zero count can be conveniently read from the indptr array. This is achieved by reading some amount of entries from the huge indptr file that corresponds to the desired chunk size. The last entry of that portion of the indptr file minus the number of non-zero values before gives the number of non-zeros in that chunk. So the chunks data and indices arrays are just sliced from the big data and indices files. The indptr array needs to be prepended artificially with a zero (that's what the format wants, don't ask me :D).
Then we can just construct a sparse matrix with the chunk data, indices and indptr to get a new sparse matrix.
It has to be noted that the actual matrix size cannot be directly reconstructed from the three arrays alone. It is either the maximum column index of the matrix, or if you are unlucky and there is no data in the chunk undetermined. So we also need to pass the column count.
I probably explained things in a rather complicated way, so just read this just as opaque piece of code, that implements such a generator:
import numpy as np
import scipy.sparse
def gen_batches(batch_size, sparse_data_path, sparse_indices_path,
sparse_indptr_path, dtype=np.float32, column_size=None):
data_item_size = dtype().itemsize
with open(sparse_data_path, 'rb') as data_file, \
open(sparse_indices_path, 'rb') as indices_file, \
open(sparse_indptr_path, 'rb') as indptr_file:
nnz_before = np.fromstring(indptr_file.read(4), dtype=np.int32)
while True:
indptr_batch = np.frombuffer(nnz_before.tobytes() +
indptr_file.read(4*batch_size), dtype=np.int32)
if len(indptr_batch) == 1:
break
batch_indptr = indptr_batch - nnz_before
nnz_before = indptr_batch[-1]
batch_nnz = np.asscalar(batch_indptr[-1])
batch_data = np.frombuffer(data_file.read(
data_item_size * batch_nnz), dtype=dtype)
batch_indices = np.frombuffer(indices_file.read(
4 * batch_nnz), dtype=np.int32)
dimensions = (len(indptr_batch)-1, column_size)
matrix = scipy.sparse.csr_matrix((batch_data,
batch_indices, batch_indptr), shape=dimensions)
yield matrix
if __name__ == '__main__':
sparse = scipy.sparse.random(5, 4, density=0.1, format='csr', dtype=np.float32)
sparse.data.tofile('sparse.data') # dtype as specified above ^^^^^^^^^^
sparse.indices.tofile('sparse.indices') # dtype=int32
sparse.indptr.tofile('sparse.indptr') # dtype=int32
print(sparse.toarray())
print('========')
for batch in gen_batches(2, 'sparse.data', 'sparse.indices',
'sparse.indptr', column_size=4):
print(batch.toarray())
the numpy.ndarray.tofile() just stores binary arrays, so you need to remember the data format. scipy.sparse represents the indices and indptr as int32, so that's a limitation for the total matrix size.
Also I benchmarked the code and found that the scipy csr matrix constructor is the bottleneck for small matrices. Your mileage might vary tho, this is just a 'proof of principle'.
If there is need for a more sophisticated implementation, or something is too obstruse, just hit me up :)

Related

How do I read Row Wise instead of column wise with h5py?

I have this matlab file that is of shape 70x10,000,000 (10,000,000 columns 70 rows)
Whats annoying is that when I run this line of code which is supposed to print that chunk of data,
f = h5py.File(filepath, 'r')
item = list(f.items())[0][1]
print(item)
it reshapes it into 10,000,000x70 (10,000,000 rows, 70 columns)
Is there a way to keep the original shape ?
h5py returns HDF5 data as Numpy arrays. So, the key to using h5py is using Numpy methods when needed. You can easily transpose an array using np.transpose(). A simple example is provided below. It creates a HDF5 file with 2 datasets: 1) an array with shape (20,5), and 2) the transposed array with shape (5,20). Then it extracts the 2 arrays and uses np.transpose() to switch row/column order.
with h5py.File('SO_67031436','w') as h5w:
arr = np.arange(100.).reshape(20,5)
h5w.create_dataset('ds_1',data=arr)
h5w.create_dataset('ds_1t',data=np.transpose(arr))
with h5py.File('SO_67031436','r') as h5r:
for name in h5r:
print(name,',shape=',h5r[name].shape)
arr=np.transpose(h5r[name][:])
print('transposed shape=',arr.shape)

tile operation to create a csr_matrix from one row of another csr_matrix

I have a csr_matrix 'a' type of sparse matrix. I want to perform an operation to create a new csr_matrix 'b' where each row of 'b' is same ith row of 'a'.
I think for normal numpy arrays it is possible using 'tile' operation. But I am not able to find the same for csr_matrix.
Making first a numpy matrix and converting to csr_matrix is not an option as the size of matrix is 10000 x 10000.
I actually could get to answer which doesn't require creating full numpy matrix and is quite fast for my purpose. So adding it as answer if it's useful for people in future:
rows, cols = a.shape
b = scipy.sparse.csr_matrix((np.tile(a[2].data, rows), np.tile(a[2].indices, rows),
np.arange(0, rows*a[2].nnz + 1, a[2].nnz)), shape=a.shape)
This takes 2nd row of 'a' and tiles it to create 'b'.
Following is the timing test, seems quite fast for 10000x10000 matrix:
100 loops, best of 3: 2.24 ms per loop
There is a blk format, that lets you create a new sparse matrix from a list of other matrices.
So for a start you could
a1 = a[I,:]
ll = [a1,a1,a1,a1]
sparse.blk_matrix(ll)
I don't have a shell running to test this.
Internally this format turns all input arrays into coo format, and collects their coo attributes into 3 large lists (or arrays). In your case of tiled rows, the data and col (j) values would just repeat. The row (I) values would step.
Another way to approach it would be to construct a small test matrix, and look at the attributes. What kinds of repetition do you see? It's easy to see patterns in the cooformat. lil might also be easy to replicate, maybe with the list *n operation. csr is trickier to understand.
One can do
row = a.getrow(row_idx)
n_rows = a.shape[0]
b = tiled_row = sp.sparse.vstack(np.repeat(row, n_rows))

Storing multiple arrays within multiple arrays within an array Python/Numpy

I have a text file with 93 columns and 1699 rows that I have imported into Python. The first three columns do not contain data that is necessary for what I'm currently trying to do. Within each column, I need to divide each element (aka row) in the column by all of the other elements (rows) in that same column. The result I want is an array of 90 elements where each of 1699 elements has 1699 elements.
A more detailed description of what I'm attempting: I begin with Column3. At Column3, Row1 is to be divided by all the other rows (including the value in Row1) within Column3. That will give Row1 1699 calculations. Then the same process is done for Row2 and so on until Row1699. This gives Column3 1699x1699 calculations. When the calculations of all of the rows in Column 3 have completed, then the program moves on to do the same thing in Column 4 for all of the rows. This is done for all 90 columns which means that for the end result, I should have 90x1699x1699 calculations.
My code as it currently is is:
import numpy as np
from glob import glob
fnames = glob("NIR_data.txt")
arrays = np.array([np.loadtxt(f, skiprows=1) for f in fnames])
NIR_values = np.concatenate(arrays)
NIR_band = NIR_values.T
C_values = []
for i in range(3,len(NIR_band)):
for j in range(0,len(NIR_band[3])):
loop_list = NIR_band[i][j]/NIR_band[i,:]
C_values.append(loop_list)
What it produces is an array of 1699x1699 dimension. Each individual array is the results from the Row calculations. Another complaint is that the code takes ages to run. So, I have two questions, is it possible to create the type of array I'd like to work with? And, is there a faster way of coding this calculation?
Dividing each of the numbers in a given column by each of the other values in the same column can be accomplished in one operation as follows.
result = a[:, numpy.newaxis, :] / a[numpy.newaxis, :, :]
Because looping over the elements happens in the optimized binary depths of numpy, this is as fast as Python is ever going to get for this operation.
If a.shape was [1699,90] to begin with, then the result will have shape [1699,1699,90]. Assuming dtype=float64, that means you will need nearly 2 GB of memory available to store the result.
First let's focus on the load:
arrays = np.array([np.loadtxt(f, skiprows=1) for f in fnames])
NIR_values = np.concatenate(arrays)
Your text talks about loading a file, and manipulating it. But this clip loads multple files and joins them.
My first change is to collect the arrays in a list, not another array
alist = [np.loadtxt(f, skiprows=1) for f in fnames]
If you want to skip some columns, look at using the usecols parameter. That may save you work later.
The elements of alist will now be 2d arrays (of floats). If they are matching sizes (N,M), they can be joined in various ways. If there are n files, then
arrays = np.array(alist) # (n,N,M) array
arrays = np.concatenate(alist, axis=0) # (n*N, M) array
# similarly for axis=1
Your code does the same, but potentially confuses steps:
In [566]: arrays = np.array([np.ones((3,4)) for i in range(5)])
In [567]: arrays.shape
Out[567]: (5, 3, 4) # (n,N,M) array
In [568]: NIR_values = np.concatenate(arrays)
In [569]: NIR_values.shape
Out[569]: (15, 4) # (n*N, M) array
NIR_band is now (4,15), and it's len() is the .shape[0], the size of the 1st dimension.len(NIR_band[3])isshape[1]`, the size of the 2nd dimension.
You could skip the columns of NIR_values with NIR_values[:,3:].
I get lost in the rest of text description.
The NIR_band[i][j]/NIR_band[i,:], I would rewrite as NIR_band[i,j]/NIR_band[i,:]. What's the purpose of that?
As for you subject line, Storing multiple arrays within multiple arrays within an array - that sounds like making a 3 or 4d array. arrays is 3d, NIR_valus is 2d.
Creating a (90,1699,1699) from a (93,1699) will probably involve (without iteration) a calculation analogous to:
In [574]: X = np.arange(13*4).reshape(13,4)
In [575]: X.shape
Out[575]: (13, 4)
In [576]: (X[3:,:,None]+X[3:,None,:]).shape
Out[576]: (10, 4, 4)
The last dimension is expanded with None (np.newaxis), and 2 versions broadcasted against each other. np.outer does the multiplication of this calculation.

Why are lil_matrix and dok_matrix so slow compared to common dict of dicts?

I want to iteratively build sparse matrices, and noticed that there are two suitable options for this according to the SciPy documentation:
LiL matrix:
class scipy.sparse.lil_matrix(arg1, shape=None, dtype=None,
copy=False)[source] Row-based linked list sparse matrix
This is an efficient structure for constructing sparse matrices
incrementally.
DoK matrix:
class scipy.sparse.dok_matrix(arg1, shape=None, dtype=None,
copy=False)[source] Dictionary Of Keys based sparse matrix.
This is an efficient structure for constructing sparse matrices
incrementally.
But when I'm running benchmarks comparing to building a dictionary of dictionary of values (which later can be easily converted to sparse matrix), the latter turns out to be about 10-20 times faster than using any of the sparse matrix models:
from scipy.sparse import dok_matrix, lil_matrix
from timeit import timeit
from collections import defaultdict
def common_dict(rows, cols):
freqs = defaultdict(lambda: defaultdict(int))
for row, col in zip(rows, cols):
freqs[row][col] += 1
return freqs
def dok(rows, cols):
freqs = dok_matrix((1000,1000))
for row, col in zip(rows, cols):
freqs[row,col] += 1
return freqs
def lil(rows, cols):
freqs = lil_matrix((1000,1000))
for row, col in zip(rows, cols):
freqs[row,col] += 1
return freqs
def benchmark():
cols = range(1000)
rows = range(1000)
res = timeit("common_dict({},{})".format(rows, cols),
"from __main__ import common_dict",
number=100)
print("common_dict: {}".format(res))
res = timeit("dok({},{})".format(rows, cols),
"from __main__ import dok",
number=100)
print("dok: {}".format(res))
res = timeit("lil({},{})".format(rows, cols),
"from __main__ import lil",
number=100)
print("lil: {}".format(res))
Results:
benchmark()
common_dict: 0.11778324202168733
dok: 2.2927695910912007
lil: 1.3541790939634666
What is it that causes such a overhead for the matrix models, and is there some way to speed it up? Are there use cases where either dok or lil are to prefer over a common dict of dicts?
When I change your += to just = for your 2 sparse arrays:
for row, col in zip(rows, cols):
#freqs[row,col] += 1
freqs[row,col] = 1
their respective times are cut in half. What's consuming the most time is the indexing. With += it is has to do both a __getitem__ and a __setitem__.
When the docs say that dok and lil are better for iterative construction they mean that it's easier to expand their underlying data structures than for the other formats.
When I try to make a csr matrix with your code, I get a:
/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:690: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
and 30x slower speed.
So the speed claims are relative to formats like csr, not relative to pure Python or numpy structures.
You might want to look at the Python code for dok_matrix.__get_item__ and dok_matrix.__set_item__ to see what happens when you do freq[r,c].
A faster way to construct your dok would be:
freqs = dok_matrix((1000,1000))
d = dict()
for row, col in zip(rows, cols):
d[(row, col)] = 1
freqs.update(d)
taking advantage of the fact that a dok is a subclassed dictionary. Note that dok matrix is not a dictionary of dictionaries. Its keys are tuples like (50,50).
Another fast way of constructing the same sparse array is:
freqs = sparse.coo_matrix((np.ones(1000,int),(rows,cols)))
In other words, since you already have the rows and cols arrays (or ranges), calculate the corresponding data array, and THEN construct the sparse array.
But if you must perform sparse operations on your matrix between incremental growth steps, then dok or lil may be your best choices.
Sparse matricies were developed for linear algebra problems, such as solving a linear equation with a large sparse matrix. I used them years ago in MATLAB to solve finite difference problems. For that work the calculation friendly csr format is the ultimate goal, and the coo format was a convenient initialization format.
Now many of the SO scipy sparse questions arise from scikit-learn and text analysis problems. They are also used in a biological database files. But still the (data),(row,col) definition method works best.
So sparse matrices were never intended for fast incremental creation. The traditional Python structures like dictionaries and lists are much better for that.
Here's a faster dok iteration that takes advantage of its dictionary methods. update seems to work as fast as on a plain dictionary. get is about 3x faster the equivalent indexing (freq[row,col]). Indexing probably uses get, but must have a lot of overhead.
def fast_dok(rows, cols):
freqs = dok_matrix((1000,1000))
for row, col in zip(rows,cols):
i = freqs.get((row,col),0)
freqs.update({(row,col):i+1})
return freqs
Skipping the get, and just doing
freqs.update({(row,col): 1)
is even faster - faster than the defaultdict of defaultdict example, and nearly as fast as simple dictionary initialization ({(r, c):1 for r,c in zip(rows, cols)})
There are various reasons why your test is not fair. Firstly, you're including the overhead of constructing the sparse matrices as part of your timed loop.
Secondly, and arguably more importantly, you should use data structures as they are designed to be used, with operations on the whole array at once. That is, rather than iterating over the rows and columns and adding 1 each time, simply add 1 to the whole array.

Select specefic rows from a 2d Numpy array using a sparse binary 1-d array

I am having a issues figuring out to do this operation
So I have and the variable index 1xM sparse binary array and I have a 2-d array (NxM) samples. I want to use index to select specific rows of samples adnd get a 2-d array.
I have tried stuff like:
idx = index.todense() == 1
samples[idx.T,:]
but nothing.
So far I have made it work doing this:
idx = test_x.todense() == 1
selected_samples = samples[np.array(idx.flat)]
But there should be a cleaner way.
To give an idea using a fraction of the data:
print(idx.shape) # (1, 22360)
print(samples.shape) (22360, 200)
The short answer:
selected_samples = samples[index.nonzero()[1]]
The long answer:
The first problem is that your index matrix is 1xN while your sample ndarray is NxM. (See the mismatch?) This is why you needed to call .flat.
That's not a big deal, though, because we just need the nonzero entries in the sparse vector. Get those with index.nonzero(), which returns a tuple of (row indices, column indices). We only care about the column indices, so we use index.nonzero()[1] to get those by themselves.
Then, simply index with the array of nonzero column indices and you're done.

Categories

Resources