Python h5py - efficient access of arrays of ragged arrays

Python h5py - efficient access of arrays of ragged arrays - python

I have a large h5py file with several ragged arrays in a large dataset. The arrays have one of the following types:
# Create types of lists of variable length vectors
vardoub = h5py.special_dtype(vlen=np.dtype('double'))
varint = h5py.special_dtype(vlen=np.dtype('int8'))
Within an HDF5 group (grp), I create datasets of N jagged items, e.g.:
d = grp.create_dataset("predictions", (N,), dtype=vardoub)
and populate d[0], d[1], ..., d[N-1] with long numpy arrays (usually in the hundreds of millions).
Creating these arrays works well, my issue is related to access. If I want to access a slice from one of the arrays, e.g. d[0][5000:6000] or d[0][50, 89, 100], the memory usage goes through the roof and I believe that it is reading in large sections of the array; I can watch physical memory usage rise from 5-6 GB to 32 GB (size of RAM on the machine) very quickly. p = d[0] reads the whole array into memory, so I think this is happening and then it is indexing into it.
Is there a better way to do this? d[n]'s type is a numpy array and I cannot take a ref of it. I suspect that I could restructure the data so that I have groups for each of the indices e.g. '0/predictions', '1/predictions', ..., but I would prefer not to have to convert this if there is a reasonable alternative.
Thank you,
Marie

Related

efficient iterative creation of multiple numpy arrays at once

I have a file with millions of lines, each of which is a list of integers (these sublists are in the range of tens to hundreds of items). What I want is to read through the file contents once and create 3 numpy arrays -- one with the average of each sublist, one with the length of each sublist, and one which is a flattened list of all the values in all the sublists.
If I just wanted one of these things, I'd do something like:
counts = np.fromiter((len(json.loads(line.rstrip())) for line in mystream), int)
but if I write 3 of those, my code would iterate through my millions of sublists 3 times, and I obviously only want to iterate through them once. So I want to do something like this:
averages = []
counts = []
allvals = []
for line in mystream:
sublist = json.loads(line.rstrip())
averages.append(np.average(sublist))
counts.append(len(sublist))
allvals.extend(sublist)
I believe that creating regular arrays as above and then doing
np_averages = np.array(averages)
Is very inefficient (basically creating the list twice). What is the right/efficient way to iteratively create a numpy array if it's not practical to use fromiter? Or do I want to create a function that returns the 3 values and do something like list comprehension for multiple return function? with fromiter instead of traditional list comprehension?
Or would it be efficient to create a 2D array of
[[count1, average1, sublist1], [count1, average2, sublist2], ...] and then doing additional operations to slice off (and in the 3rd case also flatten) the columns as their own 1D arrays?

First of all, the json library is not the most optimized library for that. You can use the pysimdjson package based on the optimized simdjson library to speed up the computation. For small integer lists, it is about twice faster on my machine.
Moreover, Numpy functions are not great for relatively small arrays as they introduce a pretty big overhead. For example, np.average takes about 8-10 us on my machine to compute an array of 20 items. Meanwhile, sum(sublist)/len(sublist) only takes 0.25-0.30 us.
Finally, np.array needs to iterate twice to convert the list into an array because it does not know the type of all objects. You can specify it so to make the convertion faster: np.array(averages, np.float64).
Here is a significantly faster implementation:
import simdjson
averages = []
counts = []
allvals = []
for line in mystream:
sublist = simdjson.loads(line.rstrip())
averages.append(sum(sublist) / len(sublist))
counts.append(len(sublist))
allvals.extend(sublist)
np_averages = np.array(averages, np.float64)
One issue with this implementation is that allvals will contain all the values in the form of a big list of objects. CPython objects are quite big in memory compared to native Numpy integers (especially compared to 32-bit=4bytes integers) since each object takes usually 32 bytes and the reference in the list takes usually 8 bytes (resulting in 40 bytes per items, that is to say 10 times more than Numpy 32-bit-integer-based arrays). Thus, I may be better to use a native implementation, possibly based on Cython.

RAM usage in dealing with numpy arrays and Python lists

I have memory issues and can't understand why. I'm using Google Colab, that gives me 12GB of RAM and let me see how the RAM usage is.
I'm reading np.array from files, and loading each array in a list.
database_list = list()
for filename in glob.glob('*.npy'):
temp_img = np.load(filename)
temp_img = temp_img.reshape((-1, 64)).astype('float32')
temp_img = cv2.resize(temp_img, (64, 3072), interpolation=cv2.INTER_LINEAR)
database_list.append(temp_img)
The code print("INTER_LINEAR: %d bytes" % (sys.getsizeof(database_list))) prints:
INTER_LINEAR: 124920 bytes
It is the same value for arrays reshaped as 64x64, 512x64, 1024x64, 2048x64 and for 3072x64. But if I reshape these arrays as 4096x64, I get an error, for too much RAM used.
With arrays of 3072x64 I can see the RAM usage get higher and higher and then going back down.
My final goal is to zero-padding each array to a dimension of 8192x64, but my session crash before; but this is another problem.
How is the RAM used? Why, if the arrays have different dimensions, the list has the same size? How python is loading and manipulating this file, that explains the RAM usage history?
EDIT:
Does then
sizeofelem = database_list[0].nbytes
#all arrays have now the same dimensions MxN, so despite its content, they should occupy the same memory
total_size = sizeofelem * len(database_list)
work and total_sizereflects the correct size of the list?

I've got the solution.
First of all, as Dan Mašek pointed out, I'm measuring the memory used by the array, which is a collection of pointers (roughly said). To measure the real memory usage:
(database_list[0].nbytes * len(database_list) / 1000000, "MB")
where database_list[0].nbytes is reliable as all the elements in database_list have the same size. To be more precise, I should add the array metadata and eventually all data linked to it (if, for example, I'm storing in the array other structures).
To have less impact on memory, I should know the type of data that I'm reading, that is values in range 0-65535, so:
database_list = list()
for filename in glob.glob('*.npy'):
temp_img = np.load(filename)
temp_img = temp_img.reshape((-1, 64)).astype(np.uint16)
database_list.append(temp_img)
Moreover, if I do some calculations on the data stored in database_list, for example, normalization of values in the range 0-1 like database_list = database_list/ 65535.0 (NB: database_list, as a list, does not support that operation), I should do another cast, because Python cast the type to something like float64.

Large matrix multiplication in Python - what is the best option?

I have two boolean sparse square matrices of c. 80,000 x 80,000 generated from 12BM of data (and am likely to have orders of magnitude larger matrices when I use GBs of data).
I want to multiply them (which produces a triangular matrix - however I dont get this since I don't limit the dot product to yield a triangular matrix).
I am wondering what the best way of multiplying them is (memory-wise and speed-wise) - I am going to do the computation on a m2.4xlarge AWS instance which has >60GB of RAM. I would prefer to keep the calc in RAM for speed reasons.
I appreciate that SciPy has sparse matrices and so does h5py, but have no experience in either.
Whats the best option to go for?
Thanks in advance
UPDATE: sparsity of the boolean matrices is <0.6%

If your matrices are relatively empty it might be worthwhile encoding them as a data structure of the non-False values. Say a list of tuples describing the location of the non-False values. Or a dictionary with the tuples as the keys.
If you use e.g. a list of tuples you could use a list comprehension to find the items in the second list that can be multiplied with an element from the first list.
a = [(0,0), (3,7), (5,2)] # et cetera
b = ... # idem
for r, c in a:
res = [(r, k) for j, k in b if k == j]

-- EDITED TO SATISFY BELOW COMMENT / DOWNVOTER --
You're asking how to multiply matrices fast and easy.
SOLUTION 1: This is a solved problem: use numpy. All these operations are easy in numpy, and since they are implemented in C, are rather blazingly fast.
http://www.numpy.org/
http://www.scipy.org
also see:
Very large matrices using Python and NumPy
http://docs.scipy.org/doc/scipy/reference/sparse.html
SciPy and Numpy have sparse matrices and matrix multiplication. It doesn't use much memory since (at least if I wrote it in C) it probably uses linked lists, and thus will only use the memory required for the sum of the datapoints, plus some overhead. And, it will almost certainly be blazingly fast compared to pure python solution.
SOLUTION 2
Another answer here suggests storing values as tuples of (x, y), presuming value is False unless it exists, then it's true. Alternate to this is a numeric matrix with (x, y, value) tuples.
REGARDLESS: Multiplying these would be Nasty time-wise: find element one, decide which other array element to multiply by, then search the entire dataset for that specific tuple, and if it exists, multiply and insert the result into the result matrix.
SOLUTION 3 ( PREFERRED vs. Solution 2, IMHO )
I would prefer this because it's simpler / faster.
Represent your sparse matrix with a set of dictionaries. Matrix one is a dict with the element at (x, y) and value v being (with x1,y1, x2,y2, etc.):
matrixDictOne = { 'x1:y1' : v1, 'x2:y2': v2, ... }
matrixDictTwo = { 'x1:y1' : v1, 'x2:y2': v2, ... }
Since a Python dict lookup is O(1) (okay, not really, probably closer to log(n)), it's fast. This does not require searching the entire second matrix's data for element presence before multiplication. So, it's fast. It's easy to write the multiply and easy to understand the representations.
SOLUTION 4 (if you are a glutton for punishment)
Code this solution by using a memory-mapped file of the required size. Initialize a file with null values of the required size. Compute the offsets yourself and write to the appropriate locations in the file as you do the multiplication. Linux has a VMM which will page in and out for you with little overhead or work on your part. This is a solution for very, very large matrices that are NOT SPARSE and thus won't fit in memory.
Note this solves the complaint of the below complainer that it won't fit in memory. However, the OP did say sparse, which implies very few actual datapoints spread out in giant arrays, and Numpy / SciPy handle this natively and thus nicely (lots of people at Fermilab use Numpy / SciPy regularly, I'm confident the sparse matrix code is well tested).

PyTables: indexing multiple dimensions of large arrays

I'm analysing some imaging data that consists of large 3-dimensional arrays of pixel intensities with dimensions [frame, x, y]. Since these are usually too big to hold in memory, they reside on the hard disk as PyTables arrays.
What I'd like to be able to do is read out the intensities in an arbitrary subset of pixels across all frames. The natural way to do this seems to be list indexing:
import numpy as np
import tables
tmph5 = tables.open_file('temp.hdf5', 'w')
bigarray = tmph5.create_array('/', 'bigarray', np.random.randn(1000, 200, 100))
roipixels = [[0, 1, 2, 4, 6], [34, 35, 36, 40, 41]]
roidata = bigarray[:, roipixels[0], roipixels[1]]
# IndexError: Only one selection list is allowed
Unfortunately it seems that PyTables currently only supports a single set of list indices. A further problem is that a list index can't contain duplicates - I couldn't simultaneously read pixels [1, 2] and [1, 3], since my list of pixel x-coordinates would contain [1, 1]. I know that I can iterate over rows in the array:
roidata = np.asarray([row[roipixels[0], roipixels[1]] for row in bigarray])
but these iterative reads become quite slow for the large number of frames I'm processing.
Is there a nicer way of doing this? I'm relatively new to PyTables, so if you have any tips on organising datasets in large arrays I'd love to hear them.

For whatever it's worth, I often do the same thing with 3D seismic data stored in hdf format.
The iterative read is slow due to the nested loops. If you only do a single loop (rather than looping over each row) it's quite fast (at least when using h5py. I typically only store table-like data using pytables) and does exactly what you want.
In most cases, you'll want to iterate over your lists of indicies, rather than over each row.
Basically, you want:
roidata = np.vstack([bigarray[:,i,j] for i,j in zip(*roipixels)])
Instead of:
roidata = np.asarray([row[roipixels[0],roipixels[1]] for row in bigarray])
If this is your most common use case, adjusting the chunksize of the stored array will help dramatically. You'll want long, narrow chunks, with the longest length along the first axis, in your case.
(Caveat: I haven't tested this with pytables, but it works perfectly with h5py.)

how to take a matrix in python?

i want to create a matrix of size 1234*5678 with it being filled with 1 to 5678 in row major order?>..!!

I think you will need to use numpy to hold such a big matrix efficiently , not just computation. You have ~5e6 items of 4/8 bytes means 20/40 Mb in pure C already, several times of that in python without an efficient data structure (a list of rows, each row a list).
Now, concerning your question:
import numpy as np
a = np.empty((1234, 5678), dtype=np.int)
a[:] = np.linspace(1, 5678, 5678)
You first create an array of the requested size, with type int (I assume you know you want 4 bytes integer, which is what np.int will give you on most platforms). The 3rd line uses broadcasting so that each row (a[0], a[1], ... a[1233]) is assigned the values of the np.linspace line (which gives you an array of [1, ....., 5678]). If you want F storage, that is column major:
a = np.empty((1234, 4567), dtype=np.int, order='F')
...
The matrix a will takes only a tiny amount of memory more than an array in C, and for computation at least, the indexing capabilities of arrays are much better than python lists.
A nitpick: numeric is the name of the old numerical package for python - the recommended name is numpy.

Or just use Numerical Python if you want to do some mathematical stuff on matrix too (like multiplication, ...). If they use row major order for the matrix layout in memory I can't tell you but it gets coverd in their documentation

Here's a forum post that has some code examples of what you are trying to achieve.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.