Vacuum HDF5 dataset (to remove rows of data and resize) - python

Let say I have HDF5 dataset with maxshape=(None,1000), chunk=(1,1000).
Then whenever I need to delete a some row I just zero-it (many):
ds[ix,:] = 0
What is the fastest way to vacuum-zeroth-rows and resize the array ?
Now lets add a twist. I have a dict to resolve symbols =to=> ds_ix
{ name : ds_ix }..
What is the fastest way to vacuum and keep the correct ds_ix ?

Did you mean resize the dataset when you asked 'resize the array?' (Also, I assume you meant maxshape=(None,1000).) If so, you use the .resize() method. However, if you aren't removing the last row(s), you will have to rearrange the non-zero data, then resize. (And you really don't need to zero out the row(s) since you are going to overwrite them.)
I can think of 2 approaches to rearrange the data: 1) use slice notation to define FROM and TO indices, or 2) read the dataset into a numpy array, delete the rows, and copy it back. Both involve disk I/O so it's not clear which would be faster without testing. It probably doesn't matter for small datasets and only a few deleted rows. I suspect the second method will be better if you plan to delete a lot of rows from large datasets. However, benchmark tests are required to confirm.
Note: be careful setting chunksize. Remember this controls the I/O size, and you will be doing a lot of I/O when you move rows. Setting it too small (or too large) can degrade performance. Setting to (1,1000) is probably too small. Recommended chunk size is 10 KiB to 1 MiB. (1,1000) float32 is 4 Kib.
Here are both approaches with a very small dataset.
Create a HDF5 file:
with h5py.File('SO_73353006.h5','w') as h5f:
a0, a1 = 10, 5
arr = np.arange(a0*a1).reshape(a0,a1)
ds = h5f.create_dataset('test',data=arr,maxshape=(None,a1))
Method 1: move data, then resize dataset
with h5py.File('SO_73353006.h5','r+') as h5f:
idx = 5
ds = h5f['test']
#ds[idx,:] = 0 # Not required since we will overwrite the row
a0 = ds.shape[0]
ds[idx:a0-1] = ds[idx+1:a0]
ds.resize(a0-1,axis=0)
Method 2: extract array, delete row and copy data to resized dataset
with h5py.File('SO_73353006.h5','r+') as h5f:
idx = 5
ds = h5f['test']
a0 = ds.shape[0]
a1 = ds.shape[1]
# read dataset into array and delete row
ds_arr = ds[()]
ds_arr = np.delete(ds_arr, obj=idx, axis=0)
# Resize dataset and load array
ds.resize(a0-1,axis=0) # same as above
ds[:] = ds_arr[:]
# Create a new dataset for comparison
ds2 = h5f.create_dataset('test2',data=ds_arr,maxshape=(None,a1))

Related

how do I aggregate 50 datasets within an HDf5 file

I have an HDF5 file with 2 groups, each containing 50 datasets of 4D numpy arrays of same type per group. I want to combine all 50 datasets in each group into a single dataset. In other words, instead of 2 x 50 datasets I want 2x1 dataset. How can I accomplish this? The file is 18.4 Gb in size. I am a novice at working with large datasets. I am working in python with h5py.
Thanks!
Look at this answer: How can I combine multiple .h5 file? - Method 3b: Merge all data into 1 Resizeable Dataset. It describes a way to copy data from multiple HDF5 files into a single dataset. You want to do something similar. The only difference is all of your datasets are in 1 HDF5 file.
You didn't say how you want to stack the 4D arrays. In my first answer I stacked them along axis=3. As noted in my comment, I it's easier (and cleaner) to create the merged dataset as a 5d array, and stack the data along the 5th axis (axis=4). I like this for 2 reasons: The code is simpler/easier to follow, and 2) it's more intuitive (to me) that axis=4 represents a unique dataset (instead of slicing on axis=3).
I wrote a self-contained example to demonstrate the procedure. First it creates some data and closes the file. Then it reopens the file (read only) and creates a new file for the copied datasets. It loops over the groups and and datasets in the first and copies the data into to a merged dataset in the second file. The 5D example is first, and my original 4D example follows.
Note: this is a simple example that will work for your specific case. If you are writing a general solution, it should check for consistent shapes and dtypes before blindly merging the data (which I don't do).
Code to create the Example data (2 groups, 5 datasets each):
import h5py
import numpy as np
# Create a simple H5 file with 2 groups and 5 datasets (shape=a0,a1,a2,a3)
with h5py.File('SO_69937402_2x5.h5','w') as h5f1:
a0,a1,a2,a3 = 100,20,20,10
grp1 = h5f1.create_group('group1')
for ds in range(1,6):
arr = np.random.random(a0*a1*a2*a3).reshape(a0,a1,a2,a3)
grp1.create_dataset(f'dset_{ds:02d}',data=arr)
grp2 = h5f1.create_group('group2')
for ds in range(1,6):
arr = np.random.random(a0*a1*a2*a3).reshape(a0,a1,a2,a3)
grp2.create_dataset(f'dset_{ds:02d}',data=arr)
Code to merge the data (2 groups, 1 5D dataset each -- my preference):
with h5py.File('SO_69937402_2x5.h5','r') as h5f1, \
h5py.File('SO_69937402_2x1_5d.h5','w') as h5f2:
# loop on groups in existing file (h5f1)
for grp in h5f1.keys():
# Create group in h5f2 if it doesn't exist
print('working on group:',grp)
h5f2.require_group(grp)
# Loop on datasets in group
ds_cnt = len(h5f1[grp].keys())
for i,ds in enumerate(h5f1[grp].keys()):
print('working on dataset:',ds)
if 'merged_ds' not in h5f2[grp].keys():
# If dataset doesn't exist in group, create it
# Set maxshape so dataset is resizable
ds_shape = h5f1[grp][ds].shape
merge_ds = h5f2[grp].create_dataset('merged_ds',dtype=h5f1[grp][ds].dtype,
shape=(ds_shape+(ds_cnt,)), maxshape=(ds_shape+(None,)) )
# Now add data to the merged dataset
merge_ds[:,:,:,:,i] = h5f1[grp][ds]
Code to merge the data (2 groups, 1 4D dataset each):
with h5py.File('SO_69937402_2x5.h5','r') as h5f1, \
h5py.File('SO_69937402_2x1_4d.h5','w') as h5f2:
# loop on groups in existing file (h5f1)
for grp in h5f1.keys():
# Create group in h5f2 if it doesn't exist
print('working on group:',grp)
h5f2.require_group(grp)
# Loop on datasets in group
for ds in h5f1[grp].keys():
print('working on dataset:',ds)
if 'merged_ds' not in h5f2[grp].keys():
# if dataset doesn't exist in group, create it
# Set maxshape so dataset is resizable
ds_shape = h5f1[grp][ds].shape
merge_ds = h5f2[grp].create_dataset('merged_ds',data=h5f1[grp][ds],
maxshape=[ds_shape[0],ds_shape[1],ds_shape[2],None])
else:
# otherwise, resize the merged dataset to hold new values
ds1_shape = h5f1[grp][ds].shape
ds2_shape = merge_ds.shape
merge_ds.resize(ds1_shape[3]+ds2_shape[3],axis=3)
merge_ds[ :,:,:, ds2_shape[3]:ds2_shape[3]+ds1_shape[3] ] = h5f1[grp][ds]

Out-of-core processing of sparse CSR arrays

How can one apply some function in parallel on chunks of a sparse CSR array saved on disk using Python? Sequentially this could be done e.g. by saving the CSR array with joblib.dump opening it with joblib.load(.., mmap_mode="r") and processing the chunks of rows one by one. Could this be done more efficiently with dask?
In particular, assuming one doesn't need all the possible out of core operations on sparse arrays, but just the ability to load row chunks in parallel (each chunk is a CSR array) and apply some function to them (in my case it would be e.g. estimator.predict(X) from scikit-learn).
Besides, is there a file format on disk that would be suitable for this task? Joblib works but I'm not sure about the (parallel) performance of CSR arrays loaded as memory maps; spark.mllib appears to use either some custom sparse storage format (that doesn't seem to have a pure Python parser) or LIBSVM format (the parser in scikit-learn is, in my experience, much slower than joblib.dump)...
Note: I have read documentation, various issues about it on https://github.com/dask/dask/ but I'm still not sure how to best approach this problem.
Edit: to give a more practical example, below is the code that works in dask for dense arrays but fails when using sparse arrays with this error,
import numpy as np
import scipy.sparse
import joblib
import dask.array as da
from sklearn.utils import gen_batches
np.random.seed(42)
joblib.dump(np.random.rand(100000, 1000), 'X_dense.pkl')
joblib.dump(scipy.sparse.random(10000, 1000000, format='csr'), 'X_csr.pkl')
fh = joblib.load('X_dense.pkl', mmap_mode='r')
# computing the results without dask
results = np.vstack((fh[sl, :].sum(axis=1)) for sl in gen_batches(fh.shape[0], batch_size))
# computing the results with dask
x = da.from_array(fh, chunks=(2000))
results = x.sum(axis=1).compute()
Edit2: following the discussion below, the example below overcomes the previous error but gets ones about IndexError: tuple index out of range in dask/array/core.py:L3413,
import dask
# +imports from the example above
dask.set_options(get=dask.get) # disable multiprocessing
fh = joblib.load('X_csr.pkl', mmap_mode='r')
def func(x):
if x.ndim == 0:
# dask does some heuristics with dummy data, if the x is a 0d array
# the sum command would fail
return x
res = np.asarray(x.sum(axis=1, keepdims=True))
return res
Xd = da.from_array(fh, chunks=(2000))
results_new = Xd.map_blocks(func).compute()
So I don't know anything about joblib or dask, let alone your application specific data format. But it is actually possible to read sparse matrices from disk in chunks while retaining the sparse data structure.
While the Wikipedia article for the CSR format does a great job explaining how it works, I'll give a short recap:
Some sparse Matrix, e.g.:
1 0 2
0 0 3
4 5 6
is stored by remembering each nonzero-value and the column it resides in:
sparse.data = 1 2 3 4 5 6 # acutal value
sparse.indices = 0 2 2 0 1 2 # number of column (0-indexed)
Now we are still missing the rows. The compressed format just stores how many non-zero values there are in each row, instead of storing every single values row.
Note that the non-zero count is also accumulated, so the following array contains the number of non-zero values up until and including this row. To complicate things even further, the array always starts with a 0 and thus contains num_rows+1 entries:
sparse.indptr = 0 2 3 6
so up until and including the second row there are 3 nonzero values, namely 1, 2 and 3.
Since we got this sorted out, we can start 'slicing' the matrix. The goal is to construct the data, indices and indptr arrays for some chunks. Assume the original huge matrix is stored in three binary files, which we will incrementally read. We use a generator to repeatedly yield some chunk.
For this we need to know how many non-zero values are in each chunk, and read the according amount of values and column-indices. The non-zero count can be conveniently read from the indptr array. This is achieved by reading some amount of entries from the huge indptr file that corresponds to the desired chunk size. The last entry of that portion of the indptr file minus the number of non-zero values before gives the number of non-zeros in that chunk. So the chunks data and indices arrays are just sliced from the big data and indices files. The indptr array needs to be prepended artificially with a zero (that's what the format wants, don't ask me :D).
Then we can just construct a sparse matrix with the chunk data, indices and indptr to get a new sparse matrix.
It has to be noted that the actual matrix size cannot be directly reconstructed from the three arrays alone. It is either the maximum column index of the matrix, or if you are unlucky and there is no data in the chunk undetermined. So we also need to pass the column count.
I probably explained things in a rather complicated way, so just read this just as opaque piece of code, that implements such a generator:
import numpy as np
import scipy.sparse
def gen_batches(batch_size, sparse_data_path, sparse_indices_path,
sparse_indptr_path, dtype=np.float32, column_size=None):
data_item_size = dtype().itemsize
with open(sparse_data_path, 'rb') as data_file, \
open(sparse_indices_path, 'rb') as indices_file, \
open(sparse_indptr_path, 'rb') as indptr_file:
nnz_before = np.fromstring(indptr_file.read(4), dtype=np.int32)
while True:
indptr_batch = np.frombuffer(nnz_before.tobytes() +
indptr_file.read(4*batch_size), dtype=np.int32)
if len(indptr_batch) == 1:
break
batch_indptr = indptr_batch - nnz_before
nnz_before = indptr_batch[-1]
batch_nnz = np.asscalar(batch_indptr[-1])
batch_data = np.frombuffer(data_file.read(
data_item_size * batch_nnz), dtype=dtype)
batch_indices = np.frombuffer(indices_file.read(
4 * batch_nnz), dtype=np.int32)
dimensions = (len(indptr_batch)-1, column_size)
matrix = scipy.sparse.csr_matrix((batch_data,
batch_indices, batch_indptr), shape=dimensions)
yield matrix
if __name__ == '__main__':
sparse = scipy.sparse.random(5, 4, density=0.1, format='csr', dtype=np.float32)
sparse.data.tofile('sparse.data') # dtype as specified above ^^^^^^^^^^
sparse.indices.tofile('sparse.indices') # dtype=int32
sparse.indptr.tofile('sparse.indptr') # dtype=int32
print(sparse.toarray())
print('========')
for batch in gen_batches(2, 'sparse.data', 'sparse.indices',
'sparse.indptr', column_size=4):
print(batch.toarray())
the numpy.ndarray.tofile() just stores binary arrays, so you need to remember the data format. scipy.sparse represents the indices and indptr as int32, so that's a limitation for the total matrix size.
Also I benchmarked the code and found that the scipy csr matrix constructor is the bottleneck for small matrices. Your mileage might vary tho, this is just a 'proof of principle'.
If there is need for a more sophisticated implementation, or something is too obstruse, just hit me up :)

h5py cannot convert element 0 to hsize_t

I have a boatload of images in a hdf5-file that I would like to load and analyse. Each image is 1920x1920 uint16 and loading all off them into the memory crashes the computer. I have been told that others work around that by slicing the image, e.g. if the data is 1920x1920x100 (100 images) then they read in the first 80 rows of each images and analyse that slice, then move to next slice. This I can do without problems, but when I try to create a dataset in the hdf5 file, it get a TypeError: Can't convert element 0 ... to hsize_t
I can recreate the problem with this very simplified code:
with h5py.File('h5file.hdf5','w') as f:
data = np.random.randint(100, size=(15,15,20))
data_set = f.create_dataset('data', data, dtype='uint16')
which gives the output:
TypeError: Can't convert element 0 ([[29 50 75...4 50 28 36 13 72]]) to hsize_t
I have also tried omitting the "data_set =" and the "dtype='uint16'", but I still get the same error. The code is then:
with h5py.File('h5file.hdf5','w') as f:
data = np.random.randint(100, size=(15,15,20))
f.create_dataset('data', data)
Can anyone give me any hints to what the problem is?
Cheers!
The second parameter of create_dataset is the shape parameter (see the docs), but you pass the entire array. If you want to initialize the dataset with an existing array, you must specify this with the data keyword, like this:
data_set = f.create_dataset('data', data=data, dtype="uint16")

Storing multiple arrays within multiple arrays within an array Python/Numpy

I have a text file with 93 columns and 1699 rows that I have imported into Python. The first three columns do not contain data that is necessary for what I'm currently trying to do. Within each column, I need to divide each element (aka row) in the column by all of the other elements (rows) in that same column. The result I want is an array of 90 elements where each of 1699 elements has 1699 elements.
A more detailed description of what I'm attempting: I begin with Column3. At Column3, Row1 is to be divided by all the other rows (including the value in Row1) within Column3. That will give Row1 1699 calculations. Then the same process is done for Row2 and so on until Row1699. This gives Column3 1699x1699 calculations. When the calculations of all of the rows in Column 3 have completed, then the program moves on to do the same thing in Column 4 for all of the rows. This is done for all 90 columns which means that for the end result, I should have 90x1699x1699 calculations.
My code as it currently is is:
import numpy as np
from glob import glob
fnames = glob("NIR_data.txt")
arrays = np.array([np.loadtxt(f, skiprows=1) for f in fnames])
NIR_values = np.concatenate(arrays)
NIR_band = NIR_values.T
C_values = []
for i in range(3,len(NIR_band)):
for j in range(0,len(NIR_band[3])):
loop_list = NIR_band[i][j]/NIR_band[i,:]
C_values.append(loop_list)
What it produces is an array of 1699x1699 dimension. Each individual array is the results from the Row calculations. Another complaint is that the code takes ages to run. So, I have two questions, is it possible to create the type of array I'd like to work with? And, is there a faster way of coding this calculation?
Dividing each of the numbers in a given column by each of the other values in the same column can be accomplished in one operation as follows.
result = a[:, numpy.newaxis, :] / a[numpy.newaxis, :, :]
Because looping over the elements happens in the optimized binary depths of numpy, this is as fast as Python is ever going to get for this operation.
If a.shape was [1699,90] to begin with, then the result will have shape [1699,1699,90]. Assuming dtype=float64, that means you will need nearly 2 GB of memory available to store the result.
First let's focus on the load:
arrays = np.array([np.loadtxt(f, skiprows=1) for f in fnames])
NIR_values = np.concatenate(arrays)
Your text talks about loading a file, and manipulating it. But this clip loads multple files and joins them.
My first change is to collect the arrays in a list, not another array
alist = [np.loadtxt(f, skiprows=1) for f in fnames]
If you want to skip some columns, look at using the usecols parameter. That may save you work later.
The elements of alist will now be 2d arrays (of floats). If they are matching sizes (N,M), they can be joined in various ways. If there are n files, then
arrays = np.array(alist) # (n,N,M) array
arrays = np.concatenate(alist, axis=0) # (n*N, M) array
# similarly for axis=1
Your code does the same, but potentially confuses steps:
In [566]: arrays = np.array([np.ones((3,4)) for i in range(5)])
In [567]: arrays.shape
Out[567]: (5, 3, 4) # (n,N,M) array
In [568]: NIR_values = np.concatenate(arrays)
In [569]: NIR_values.shape
Out[569]: (15, 4) # (n*N, M) array
NIR_band is now (4,15), and it's len() is the .shape[0], the size of the 1st dimension.len(NIR_band[3])isshape[1]`, the size of the 2nd dimension.
You could skip the columns of NIR_values with NIR_values[:,3:].
I get lost in the rest of text description.
The NIR_band[i][j]/NIR_band[i,:], I would rewrite as NIR_band[i,j]/NIR_band[i,:]. What's the purpose of that?
As for you subject line, Storing multiple arrays within multiple arrays within an array - that sounds like making a 3 or 4d array. arrays is 3d, NIR_valus is 2d.
Creating a (90,1699,1699) from a (93,1699) will probably involve (without iteration) a calculation analogous to:
In [574]: X = np.arange(13*4).reshape(13,4)
In [575]: X.shape
Out[575]: (13, 4)
In [576]: (X[3:,:,None]+X[3:,None,:]).shape
Out[576]: (10, 4, 4)
The last dimension is expanded with None (np.newaxis), and 2 versions broadcasted against each other. np.outer does the multiplication of this calculation.

Creation of large pandas DataFrames from Series

I'm dealing with data on a fairly large scale. For reference, a given sample will have ~75,000,000 rows and 15,000-20,000 columns.
As of now, to conserve memory I've taken the approach of creating a list of Series (each column is a series, so ~15K-20K Series each containing ~250K rows). Then I create a SparseDataFrame containing every index within these series (because as you notice, this is a large but not very dense dataset). The issue is this becomes extremely slow, and appending each column to the dataset takes several minutes. To overcome this I've tried batching the merges as well (select a subset of the data, merge these to a DataFrame, which is then merged into my main DataFrame), but this approach is still too slow. Slow meaning it only processed ~4000 columns in a day, with each append causing subsequent appends to take longer as well.
One part which struck me as odd is why my column count of the main DataFrame affects the append speed. Because my main index already contains all entries it will ever see, I shouldn't have to lose time due to re-indexing.
In anycase, here is my code:
import time
import sys
import numpy as np
import pandas as pd
precision = 6
df = []
for index, i in enumerate(raw):
if i is None:
break
if index%1000 == 0:
sys.stderr.write('Processed %s...\n' % index)
df.append(pd.Series(dict([(np.round(mz, precision),int(intensity)) for mz, intensity in i.scans]), dtype='uint16', name=i.rt))
all_indices = set([])
for j in df:
all_indices |= set(j.index.tolist())
print len(all_indices)
t = time.time()
main_df = pd.DataFrame(index=all_indices)
first = True
del all_indices
while df:
subset = [df.pop() for i in xrange(10) if df]
all_indices = set([])
for j in subset:
all_indices |= set(j.index.tolist())
df2 = pd.DataFrame(index=all_indices)
df2.sort_index(inplace=True, axis=0)
df2.sort_index(inplace=True, axis=1)
del all_indices
ind=0
while subset:
t2 = time.time()
ind+=1
arr = subset.pop()
df2[arr.name] = arr
print ind,time.time()-t,time.time()-t2
df2.reindex(main_df.index)
t2 = time.time()
for i in df2.columns:
main_df[i] = df2[i]
if first:
main_df = main_df.to_sparse()
first = False
print 'join time', time.time()-t,time.time()-t2
print len(df), 'entries remain'
Any advice on how I can load this large dataset quickly is appreciated, even if it means writing it to disk to some other format first/etc.
Some additional info:
1) Because of the number of columns, I can't use most traditional on-disk stores such as HDF.
2) The data will be queried across columns and rows when it is in use. So main_df.loc[row:row_end, col:col_end]. These aren't predictable block sizes so chunking isn't really an option. These lookups also need to be fast, on the order of ~10 a second to be realistically useful.
3) I have 32G of memory, so a SparseDataFrame I think is the best option since it fits in memory and allows fast lookups as needed. Just the creation of it is a pain at the moment.
Update:
I ended up using scipy sparse matrices and handling the indexing on my own for the time being. This results in appends at a constant rate of ~0.2 seconds which is acceptable (versus Pandas taking ~150seconds for my full dataset per append). I'd love to know how to make Pandas match this speed.

Categories

Resources