Efficiently store list of matrices - python

I have a large list of images stored as numpy matrices. The images have different sizes e.g.
import numpy as np
from numpy.random import rand
data = [rand(100,200), rand(1024, 768)]
I am looking for a way to store this list of matrices such that it can be read fast (writing the data can be slow). I tried pickle/numpy.savez, but reading the data was slower than loading the raw images again.
I think hdf5 may be fast, however I cannot figure out how to store this list. Not mandatory, but useful would data format which allows to append data such that the list does not have to in memory as a whole.
Edit:
Based on the answers so far I tried to time some suggestions
data = [rand(1024, 768) for i in np.arange(100)]
def timenp():
np.savez("test.npz",*data)
d=np.load('test.npz')
loaded = [d[f] for f in d]
def timebinary():
with file("tmp.bin","wb") as f:
np.save(f, len(data))
for img in data:
np.save(f,img)
with file("tmp.bin","rb") as f:
n = np.load(f)
loaded = []
for i in np.arange(n):
loaded.append(np.load(f))
import h5py
def timeh5py():
with h5py.File('foo.hdf5','w') as f:
dt = h5py.special_dtype(vlen=np.dtype('float32'))
dset = f.create_dataset('data', (len(data),), dtype=dt)
shapes = f.create_dataset('shapes', (len(data), 2), dtype='int32')
dset[...] = [img.flatten() for img in data]
shapes[...] = [img.shape for img in data]
with h5py.File('foo.hdf5','r') as f:
loaded=[]
for (img, shape) in zip(f['data'],f['shapes']):
loaded.append(np.reshape(img,shape))
python -m cProfile timenp.py
452906 function calls (451141 primitive calls) in 9.256 seconds
python -m cProfile timebinary.py
73085 function calls (71340 primitive calls) in 4.945 seconds
python -m cProfile timeh5py.py
33151 function calls (32568 primitive calls) in 4.384 seconds

Try using the numpy savez function , which comes in both compressed and uncompressed versions.

In [276]: alist=[np.arange(10), np.arange(3), np.arange(100)]
If I save this as np.savez('test',alist), it saves the list as one object. If instead I expand the list with *, then it puts each list element is a separate file in the archive.
In [277]: np.savez('test',*alist)
In [278]: d=np.load('test.npz')
In [279]: list(d.keys())
Out[279]: ['arr_2', 'arr_1', 'arr_0']
In [280]: d['arr_0']
Out[280]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
With np.save (and by extension savez), arrays are stored in their own compact format, which consists of a header block with shape information, and the data block, which is essentially a byte copy of its data buffer. So a np.save of an array should be as efficient any other method.
If you give np.save an non-array object it will use that object's pickle method. But note that the pickle method for an array is the save method I just described. So a pickle of an array should still be efficient.
Keep in mind that npz files are lazy-loaded.
With h5py, arrays are saved to named data sets. In sense it is like the savez above - the elements of the list have to have names, whether generated automatically or by your code.
I don't know how h5py speeds compare with a save(z).
h5py can handle arrays that are ragged in one dimension. I've explored that in previous SO questions. Storing multidimensional variable length array with h5py
How to save list of numpy.arrays of different shape with h5py?

Related

Write a Scalar to CSV (Numpy)

I'm generating a number of test files iteratively, the process derives a 0, 1 or 2 dimensional numpy array, then writes that array to CSV, at least that's the intent.
Does anyone have a good solution for this?
My code (expectedly) fails when the output is zero-dimensional (scalar):
for key in testfiles:
tname = key + ".csv"
np.savetxt(tname, testfiles[key], delimiter=",",newline=';',fmt='%0.15f')
There are a couple of ways to ensure that your input is not a scalar in numpy. For example, you could use np.array:
arr = np.array(testfiles[key], ndmin=1, copy=False)
Another option is np.atleast_1d:
arr = np.atleast_1d(testfiles[key])
Both options will attempt to make an object without copying the data. In both cases, pass arr to np.savetxt instead of testfiles[key].

Difference between list and NumPy array memory size

I've heard that Numpy arrays are more efficient then python built in list and that they take less space in memory. As I understand Numpy stores this objects next to each other in memory, while python implementation of the list stores 8 bytes pointers to given values. However, when I try to test in jupyter notebook it turns out that both objects have same size.
import numpy as np
from sys import getsizeof
array = np.array([_ for _ in range(4)])
getsizeof(array), array
Returns (128, array([0, 1, 2, 3]))
Same as:
l = list([_ for _ in range(4)])
getsizeof(l), l
Gives (128, [0, 1, 2, 3])
Can you provide any clear example on how can I show that in jupyter notebook?
getsizeof is not a good measure of memory use, especially with lists. As you note the list has a buffer of pointers to objects elsewhere in memory. getsizeof notes the size of the buffer, but tells us nothing about the objects.
With
In [66]: list(range(4))
Out[66]: [0, 1, 2, 3]
the list has its basic object storage, plus the buffer with 4 pointers (plus some growth room). The numbers are stored else where. In this case the numbers are small, and already created and cached by the interpreter. So their storage doesn't add anything. But larger numbers (and floats) are created with each use, and take up space. Also a list can contain anything, such as pointers to other lists, or strings or dicts, or what ever.
In [67]: arr = np.array([i for i in range(4)]) # via list
In [68]: arr
Out[68]: array([0, 1, 2, 3])
In [69]: np.array(range(4)) # more direct
Out[69]: array([0, 1, 2, 3])
In [70]: np.arange(4)
Out[70]: array([0, 1, 2, 3]) # faster
arr too has a basic object storage with attributes like shape and dtype. It too has a databuffer, but for a numeric dtype like this, that buffer has actual numeric values (8 byte integers), not pointers to Python integer objects.
In [71]: arr.nbytes
Out[71]: 32
That data buffer only takes 32 bytes - 4*8.
For this small example it's not surprising that getsizeof returns the same thing. The basic object storage is more significant than where the 4 values are stored. It's when working with 1000's of values, and multidimensional arrays that memory use is significantly different.
But more important is the calculation speeds. With an array you can do things like arr+1 or arr.sum(). These operate in compiled code, and are quite fast. Similar list operations have to iterate, at slow Python speeds, though the pointers, fetching values etc. But doing the same sort of iteration on arrays is even slower.
As a general rule, if you start with lists, and do list operations such as append and list comprehensions, it's best to stick with them.
But if you can create the arrays once, or from other arrays, and then use numpy methods, you'll get 10x speed improvements. Arrays are indeed faster, but only if you use them in the right way. They aren't a simple drop in substitute for lists.
NumPy array has general array information on the array object header (like shape,data type etc.). All the values stored in continous block of memory. But lists allocate new memory block for every new object and stores their pointer. So when you iterate over, you are not directly iterating on memory. you are iterating over pointers. So it is not handy when you are working with large data. Here is an example:
import sys
import numpy as np
random_values_numpy=np.arange(1000)
random_values=range(1000)
#Numpy
print(random_values_numpy.itemsize)
print(random_values_numpy.size*random_values_numpy.itemsize)
#PyList
print(sys.getsizeof(random_values))
print(sys.getsizeof(random_values)*len(random_values))

Adding big matrices stored in HDF5 datasets

I have two HDF5 files having an identical structure, each store a matrix of the same shape. I need to create a third HDF5 file with a matrix representing the element-wise sum of the two mentioned above matrices. Given the sizes of matrices are extremely large (in the Gb-Tb range), what would be the best way to do it, preferably in a parallel way? I am using the h5py interface to the HDF5 library. Are there any libraries capable of doing it?
Yes, this is possible. The key is to access slices of the data from file1 & file2, do your element-wise sum, then write that slice of new data to the file3. You can do this with h5py or PyTables (aka tables). No other libraries are required. I only have passing knowledge of parallel computing. I know h5py supports an mpi interface through the mpi4py Python package. Details here: h5py docs: Parallel HDF5
Here is a simple example. It creates 2 files with a dataset of random floats, shape=(10,10,10). It then creates a new file with an empty dataset of the same shape. The loop reads a slice of data from file1 and file2, sums them, then writes to the same slice in file3. To test with large data, you can modify the shapes to match your file.
21-Jan-2021 Update:
I added code to get the dataset shapes from file1 and file2, and compare them (to be sure they are equal). If the shapes aren't equal, I exit. If they match, I create the new file, then create a dataset of matching shape. (If you really want to be robust, you could do the same with the dtype.) I also use the value of shape[2] as the slice iterator over the dataset.
import h5py
import numpy as np
import random
import sys
arr = np.random.random(10**3).reshape(10,10,10)
with h5py.File('file1.h5','w') as h5fw :
h5fw.create_dataset('data_1',data=arr)
arr = np.random.random(10**3).reshape(10,10,10)
with h5py.File('file2.h5','w') as h5fw :
h5fw.create_dataset('data_2',data=arr)
h5fr1 = h5py.File('file1.h5','r')
f1shape = h5fr1['data_1'].shape
h5fr2 = h5py.File('file2.h5','r')
f2shape = h5fr2['data_2'].shape
if (f1shape!=f2shape):
print ('Datasets shapes do not match')
h5fr1.close()
h5fr2.close()
sys.exit('Exiting due to error.')
else:
with h5py.File('file3.h5','w') as h5fw :
ds3 = h5fw.create_dataset('data_3', shape=f1shape, dtype='f')
for i in range(f1shape[2]):
arr1_slice = h5fr1['data_1'][:,:,i]
arr2_slice = h5fr2['data_2'][:,:,i]
arr3_slice = arr1_slice + arr2_slice
ds3[:,:,i] = arr3_slice
# alternately, you can slice and sum in 1 line
# ds3[:,:,i] = h5fr1['data_1'][:,:,i] + \
# h5fr2['data_2'][:,:,i]
print ('Done.')
h5fr1.close()
h5fr2.close()

Numpy array error setting an array element with a sequence

i tryed to use multilist to hold scraped data from html
but after 50.000 list append i got memory error
So i decided to change lists to numpy array
SapList= []
ListAll = np.array([])
def eachshop(): #filling each list for each shop data
global ListAll
SapList.append(RowNum)
SapList.extend([sap]) # here can be from one to 10 values in one list["sap1","sap2","sap3",...,"sap10"]
SapList.extend([[strLink,ProdName],ProdCode,ProdH,NewPrice, OldPrice,[FileName+'#Komp!A1',KompPrice],[FileName+'#Sav!A1','Sav']])
SapList.extend([ss]) # here can be from null to 80 sublist with 3 values [["id1", "link", "address"],["id80", "link", "address"]]
ListAll = np.append(np.array(SapList))
So then i do print(ListAll) i got exception C:\Python36\scrap.py, LINE 307 "ListAll = np.append(np.array(SapList))"): setting an array element with a sequence
now for speed up i using pool.map
def makePool(cP, func, iters):
try:
pool = ThreadPool(cP)
#perebiraem Url
pool.map_async(func,enumerate(iters, start=2)).get(99999)
pool.close()
pool.join()
except:
print('Pool Error')
raise
finally:
pool.terminate()
So how to use numpy array in my example and reduce memory usage and speedup I\O operation using Numpy?
It looks like you are trying to make an array from a list that contains a number and lists. Something like:
In [6]: np.array([1, [1,2],[3,4]])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-812a9ccb6ca0> in <module>()
----> 1 np.array([1, [1,2],[3,4]])
ValueError: setting an array element with a sequence.
It does work if all elements of lists
In [7]: np.array([[1], [1,2],[3,4,5]])
Out[7]: array([list([1]), list([1, 2]), list([3, 4, 5])], dtype=object)
But if they vary in length the result is an object array, not a 2d numeric array. Such an object dtype array is very much like a list of lists, containing pointers to lists elsewhere in memory.
A multidimensional numeric array can use less memory than a list of lists, but it isn't going to help if you need to make the lists first. And it does not help at all if the sublists vary in size.
Oh, and stay away from np.append. It's evil. Plus you misused it!
As hpaulj pointed out already, numpy arrays will not help here, since you don't have consistent data sizes.
As Spinor8 suggested, dump out data in between instead:
AllList = []
limit = 10000
counter = 0
while not finished:
if counter >= limit:
print AllList
AllList = []
item = CreateYourList(...)
AllList.append(item)
counter += 1
Edit: Since your question is specifically asking about numpy and you even opened a bounty: numpy is not going to help you here, and here is why:
For using numpy efficiently, you have to know the array size at the time of array creation. numpy.array.append() doesn't actually append anything, but creates a new array, which is a huge overhead with large arrays.
Numpy arrays work best if all items have the same number of elements. Specifically, you can think of a numpy array like a matrix: all rows have the same number of columns.
You could create a numpy array based on the largest element in your data stream, but this would mean you allocate memory that you don't need (array elements that will never be filled). This will clearly not solve your memory problem.
So IMHO, your only way to solve this is to break your stream into chunks that your memory can handle, and stitch it together afterwards. Maybe write it to a (temporary) file and append to it?

How to efficiently construct a numpy array from a large set of data?

If I have a huge list of lists in memory and I wish to convert it into an array, does the naive approach cause python to make a copy of all the data, taking twice the space in memory? Should I convert a list of lists, vector by vector instead by popping?
# for instance
list_of_lists = [[...], ..., [...]]
arr = np.array(list_of_lists)
Edit:
Is it better to create an empty array of a known size and then populate it incrementally thus avoiding the list_of_lists object entirely? Could this be accomplished by something as simply as some_array[i] = some_list_of_float_values?
I'm just puttign theis here as it's a bit long for a comment.
Have you read the numpy documentation for array?
numpy.array(object, dtype=None, copy=True, order=None, subok=False, ndmin=0)
"""
...
copy : bool, optional
If true (default), then the object is copied. Otherwise, a copy will
only be made if __array__ returns a copy, if obj is a nested sequence,
or if a copy is needed to satisfy any of the other requirements (dtype,
order, etc.).
...
"""
When you say you don't want to copy the data of the original array when creating the numpy array, what data structure are you hoping to end up with?
A lot of the speed up you get from using numpy is because the C arrays that are created are contiguous in memory. An array in python is just an array of pointers to objects, so you have to go and find the objects every time - which isn't the case in numpy, as it's not written in python.
If you want to just have the numpy array reference the python arrays in your 2D array, then you'll lose the performance gains.
if you do np.array(my_2D_python_array, copy=False) i don't know what it will actually produce, but you could easily test it yourself. Look at the shape of the array, and see what kind of objects it houses.
If you want the numpy array to be contiguous though, as some point you're going to have to allocate all of the memory it needs (which if it's as large as you're suggesting, it sounds like it might be difficult to find a contiguous section large enough).
Sorry that was pretty rambling, just a comment. How big are the actual arrays you're looking at?
Here's a plot of the cpu usage and memory usage of a small sample program:
from __future__ import division
#Make a large python 2D array
N, M = 10000, 18750
print "%i x %i = %i doubles = %f GB" % (N, M, N * M, N*M*8/10**9)
#grab pid to moniter memory and cpu usage
import os
pid = os.getpid()
os.system("python moniter.py -p " + str(pid) + " &")
print "building python matrix"
large_2d_array = [[n + m*M for n in range(N)] for m in range(M)]
import numpy
from datetime import datetime
print datetime.now(), "creating numpy array with copy"
np1 = numpy.array(large_2d_array, copy=True)
print datetime.now(), "deleting array"
del(np1)
print datetime.now(), "creating numpy array with copy"
np1 = numpy.array(large_2d_array, copy=False)
print datetime.now(), "deleting array"
del(np1)
1, 2, and 3 are the points where each of the matrices finish being created. Note that the native python array takes up much more memory than the numpy ones - python objects each have their own overhead, and the lists are lists of objects. For the numpy array this is not the case, so it is considerably smaller.
Also note that using the copy on the python object has no effect - new data is always created. You could get around this by creating a numpy array of python objects (using dtype=object), but i wouldn't advise it.

Categories

Resources