Streaming multiple numpy arrays to a file - python

This differs from Write multiple numpy arrays to file in that I need to be able to stream content, rather than writing it all at once.
I need to write multiple compressed numpy arrays in binary to a file. I can not store all the arrays in memory before writing so it is more like streaming numpy arrays to a file.
This currently works fine as text
file = open("some file")
while doing stuff:
file.writelines(somearray + "\n")
where some array is a new instance every loop
however this does not work if i try and write the arrays as binary.
arrays are created at 30hz and grow too big to keep in memory. They also can not each be stored into a bunch of single array files because that would just be wasteful and cause a huge mess.
So i would like only one file per a session instead of 10k files per a session.

One option might be to use pickle to save the arrays to a file opened as an append binary file:
import numpy as np
import pickle
arrays = [np.arange(n**2).reshape((n,n)) for n in range(1,11)]
with open('test.file', 'ab') as f:
for array in arrays:
pickle.dump(array, f)
new_arrays = []
with open('test.file', 'rb') as f:
while True:
try:
new_arrays.append(pickle.load(f))
except EOFError:
break
assert all((new_array == array).all() for new_array, array in zip(new_arrays, arrays))
This might not be the fastest, but it should be fast enough. It might seem like this would take up more data, but comparing these:
x = 300
y = 300
arrays = [np.random.randn(x, y) for x in range(30)]
with open('test2.file', 'ab') as f:
for array in arrays:
pickle.dump(array, f)
with open('test3.file', 'ab') as f:
for array in arrays:
f.write(array.tobytes())
with open('test4.file', 'ab') as f:
for array in arrays:
np.save(f, array)
You'll find the file sizes as 1,025 KB, 1,020 KB, and 1,022 KB respectively.

An NPZ file is just a zip archive, so you could save each array to a temporary NPY file, add that NPY file to the zip archive, and then delete the temporary file.
For example,
import os
import zipfile
import numpy as np
# File that will hold all the arrays.
filename = 'foo.npz'
with zipfile.ZipFile(filename, mode='w', compression=zipfile.ZIP_DEFLATED) as zf:
for i in range(10):
# `a` is the array to be written to the file in this iteration.
a = np.random.randint(0, 10, size=20)
# Name for the temporary file to which `a` is written. The root of this
# filename is the name that will be assigned to the array in the npz file.
# I've used 'arr_{}' (e.g. 'arr_0', 'arr_1', ...), similar to how `np.savez`
# treats positional arguments.
tmpfilename = "arr_{}.npy".format(i)
# Save `a` to a npy file.
np.save(tmpfilename, a)
# Add the file to the zip archive.
zf.write(tmpfilename)
# Delete the npy file.
os.remove(tmpfilename)
Here's an example where that script is run, and then the data is read back using np.load:
In [1]: !ls
add_array_to_zip.py
In [2]: run add_array_to_zip.py
In [3]: !ls
add_array_to_zip.py foo.npz
In [4]: foo = np.load('foo.npz')
In [5]: foo.files
Out[5]:
['arr_0',
'arr_1',
'arr_2',
'arr_3',
'arr_4',
'arr_5',
'arr_6',
'arr_7',
'arr_8',
'arr_9']
In [6]: foo['arr_0']
Out[6]: array([0, 9, 3, 7, 2, 2, 7, 2, 0, 5, 8, 1, 1, 0, 4, 2, 5, 1, 8, 2])
You'll have to test this on your system to see if it can keep up with your array generation process.
Another alternative is to use something like HDF5, with either h5py or pytables.

Related

Saving Python Results as .txt File

I have this code (unique_set=np.random.choice([0, 1], (10000, 10, 10, 10))) that generates 10000 3D binary matrices and I'm attempting to save the result as a .txt file. The other similar questions I checked were either trying to write a print statement to a file or were noticeably different. I tried so many of the solutions like the one below, but none of them worked.
sys.stdout = open("test.txt", "w")
print(unique_set)
sys.stdout.close()
Try this one
import numpy as np
file = open('D:\\yourpath\\filename.txt', 'w')
unique_set=np.random.choice([0, 1], (10000, 10, 10, 10))
file.write('%s\n' %unique_set)
Not knowing how the format of your output file should look like, this is one possibility:
np.savetxt("test.txt", unique_set.flatten(), delimiter=",")
You can store as JSON text file (preserves it being a 4D array)
--storing Numpy N dimensional arrays
import json
with open('"test.txt"', 'w') as f:
json.dump(unique_set.tolist(), f)

H5Py and storage

I am writing some code which needs to save a very large numpy array to memory. The numpy array is so large in fact that I cannot load it all into memory at once. But I can calculate the array in chunks. I.e. my code looks something like:
for i in np.arange(numberOfChunks):
myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = #... do some calculation
As I can't load myArray into memory all at once, I want to save it to a file one "chunk" at a time. i.e. I want to do something like this:
for i in np.arange(numberOfChunks):
myArrayChunk = #... do some calculation to obtain chunk
saveToFile(myArrayChunk, indicesInFile=[(i*chunkSize):(i*(chunkSize+1)),:,:], filename)
I understand this can be done with h5py but I am a little confused how to do this. My current understanding is that I can do this:
import h5py
# Make the file
h5py_file = h5py.File(filename, "a")
# Tell it we are going to store a dataset
myArray = h5py_file.create_dataset("myArray", myArrayDimensions, compression="gzip")
for i in np.arange(numberOfChunks):
myArrayChunk = #... do some calculation to obtain chunk
myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = myArrayChunk
But this is where I become a little confused. I have read that if you index a h5py datatype like I did when I wrote myArray[(i*chunkSize):(i*(chunkSize+1)),:,:], then this part of myArray has now been read into memory. So surely, by the end of my loop above, have I not still got the whole of myArray in memory now? How has this saved my memory?
Similarly, later on, I would like to read in my file back in one chunk at a time, doing further calculation. i.e. I would like to do something like:
import h5py
# Read in the file
h5py_file = h5py.File(filename, "a")
# Read in myArray
myArray = h5py_file['myArray']
for i in np.arange(numberOfChunks):
# Read in chunk
myArrayChunk = myArray[(i*chunkSize):(i*(chunkSize+1)),:,:]
# ... Do some calculation on myArrayChunk
But by the end of this loop is the whole of myArray now in memory? I am a little confused by when myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] is in memory and when it isn't. Please could someone explain this.
You have the basic idea. Take care when saying "save to memory". NumPy arrays are saved in memory (RAM). HDF5 data is saved on disk (not to memory/RAM!), then accessed (memory used depends on how you access). In the first step you are creating and writing data in chunks to the disk. In the second step you are accessing data from disk in chunks. Working example provided at the end.
When reading data with h5py there 2 ways to read the data:
This returns a NumPy array:
myArrayNP = myArray[:,:,:]
This returns a h5py dataset object that operates like a NumPy array:
myArrayDS = myArray
The difference: h5py dataset objects are not read into memory all at once. You can then slice them as needed. Continuing from above, this is a valid operation to get a subset of the data:
myArrayChunkNP = myArrayDS[i*chunkSize):(i+1)*chunkSize),:,:]
My example also corrects 1 small error in your chunksize increment equation.
You had:
myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = myArrayChunk
You want:
myArray[(i*chunkSize):(i+1)*chunkSize),:,:] = myArrayChunk
Working Example (writes and reads):
import h5py
import numpy as np
# Make the file
with h5py.File("SO_61173314.h5", "w") as h5w:
numberOfChunks = 3
chunkSize = 4
print( 'WRITING %d chunks with w/ chunkSize=%d ' % (numberOfChunks,chunkSize) )
# Write dataset to disk
h5Array = h5w.create_dataset("myArray", (numberOfChunks*chunkSize,2,2), compression="gzip")
for i in range(numberOfChunks):
h5ArrayChunk = np.random.random(chunkSize*2*2).reshape(chunkSize,2,2)
print (h5ArrayChunk)
h5Array[(i*chunkSize):((i+1)*chunkSize),:,:] = h5ArrayChunk
with h5py.File("SO_61173314.h5", "r") as h5r:
print( '/nREADING %d chunks with w/ chunkSize=%d/n' % (numberOfChunks,chunkSize) )
# Access myArray dataset - Note: This is NOT a NumpPy array
myArray = h5r['myArray']
for i in range(numberOfChunks):
# Read a chunk into memory (as a NumPy array)
myArrayChunk = myArray[(i*chunkSize):((i+1)*chunkSize),:,:]
# ... Do some calculation on myArrayChunk
print (myArrayChunk)

Numpy load result file saved in append mode

I have one big file saved using numpy in append mode, i.e., it contains maybe 5000 arrays, each with shape, e.g. [1, 224, 224, 3], like this way:
filepath = 'hello'
for some loop:
...
with open(filepath, 'ab') as f:
np.save(f, ndarray)
I need to load the data in the file, maybe all arrays, or maybe in some generating mode, like reading the first 100, then the next 100, and so on. Is there any method to do this properly? Now, I only know if I use np.load, I can only get one array each time, and I don't know how to read the 100 to 199 arrays.
loading arrays saved using numpy.save in append mode
This question talk about something on this, but seems not what I want.
One solution, although ugly and can only get all arrays in the file (and thus risk the out of memory error) is as following:
a = []
with open(filepath, 'rb') as f:
while True:
try:
a.append(np.load(f))
except:
break
np.stack(a)
This is more of a hack (given your situation).
Anyway, here is the one that created the files with np.save in append mode:
import numpy as np
numpy_arrays = [np.array ([1, 2, 3]), np.array([0, 9])]
print numpy_arrays[0], numpy_arrays[1]
print type(numpy_arrays[0]), type(numpy_arrays[1])
for numpy_array in numpy_arrays:
with open ("./my-numpy-arrays.bin", 'ab') as f:
np.save(f, numpy_array)
[1 2 3] [0 9]
<type 'numpy.ndarray'> <type 'numpy.ndarray'>
... and here is the code that checks IOException (and other errors) while looping through.
with open ("./my-numpy-arrays.bin", 'rb') as f:
while True:
try :
numpy_array = np.load(f)
print numpy_array
except :
break
[1 2 3]
[0 9]
Not very pretty but ... it works.

Saving a dictionary of numpy arrays in human-readable format

This is not a duplicate question. I looked around a lot and found this question, but the savezand pickle utilities render the file unreadable by a human. I want to save it in a .txt file which can be loaded back into a python script. So I wanted to know whether there are some utilities in python which can facilitate this task and keep the written file readable by a human.
The dictionary of numpy arrays contains 2D arrays.
EDIT:
According to Craig's answer, I tried the following :
import numpy as np
W = np.arange(10).reshape(2,5)
b = np.arange(12).reshape(3,4)
d = {'W':W, 'b':b}
with open('out.txt', 'w') as outfile:
outfile.write(repr(d))
f = open('out.txt', 'r')
d = eval(f.readline())
print(d)
This gave the following error: SyntaxError: unexpected EOF while parsing.
But the out.txtdid contain the dictionary as expected. How can I load it correctly?
EDIT 2:
Ran into a problem : Craig's answer truncates the array if the size is large. The out.txt shows first few elements, replaces the middle elements by ... and shows the last few elements.
Convert the dict to a string using repr() and write that to the text file.
import numpy as np
d = {'a':np.zeros(10), 'b':np.ones(10)}
with open('out.txt', 'w') as outfile:
outfile.write(repr(d))
You can read it back in and convert to a dictionary with eval():
import numpy as np
f = open('out.txt', 'r')
data = f.read()
data = data.replace('array', 'np.array')
d = eval(data)
Or, you can directly import array from numpy:
from numpy import array
f = open('out.txt', 'r')
data = f.read()
d = eval(data)
H/T: How can a string representation of a NumPy array be converted to a NumPy array?
Handling large arrays
By default, numpy summarizes arrays longer than 1000 elements. You can change this behavior by calling numpy.set_printoptions(threshold=S) where S is larger than the size of the arrays. For example:
import numpy as np
W = np.arange(10).reshape(2,5)
b = np.arange(12).reshape(3,4)
d = {'W':W, 'b':b}
largest = max(np.prod(a.shape) for a in d.values()) #get the size of the largest array
np.set_printoptions(threshold=largest) #set threshold to largest to avoid summarizing
with open('out.txt', 'w') as outfile:
outfile.write(repr(d))
np.set_printoptions(threshold=1000) #recommended, but not necessary
H/T: Ellipses when converting list of numpy arrays to string in python 3

How to load one line at a time from a pickle file?

I have a large dataset: 20,000 x 40,000 as a numpy array. I have saved it as a pickle file.
Instead of reading this huge dataset into memory, I'd like to only read a few (say 100) rows of it at a time, for use as a minibatch.
How can I read only a few randomly-chosen (without replacement) lines from a pickle file?
You can write pickles incrementally to a file, which allows you to load them
incrementally as well.
Take the following example. Here, we iterate over the items of a list, and
pickle each one in turn.
>>> import cPickle
>>> myData = [1, 2, 3]
>>> f = open('mydata.pkl', 'wb')
>>> pickler = cPickle.Pickler(f)
>>> for e in myData:
... pickler.dump(e)
<cPickle.Pickler object at 0x7f3849818f68>
<cPickle.Pickler object at 0x7f3849818f68>
<cPickle.Pickler object at 0x7f3849818f68>
>>> f.close()
Now we can do the same process in reverse and load each object as needed. For
the purpose of example, let's say that we just want the first item and don't
want to iterate over the entire file.
>>> f = open('mydata.pkl', 'rb')
>>> unpickler = cPickle.Unpickler(f)
>>> unpickler.load()
1
At this point, the file stream has only advanced as far as the first
object. The remaining objects weren't loaded, which is exactly the behavior you
want. For proof, you can try reading the rest of the file and see the rest is
still sitting there.
>>> f.read()
'I2\n.I3\n.'
Since you do not know the internal workings of pickle, you need to use another storing method. The script below uses the tobytes() functions to save the data line-wise in a raw file.
Since the length of each line is known, it's offset in the file can be computed and accessed via seek() and read(). After that, it is converted back to an array with the frombuffer() function.
The big disclaimer however is that the size of the array in not saved (this could be added as well but requires some more complications) and that this method might not be as portable as a pickled array.
As #PadraicCunningham pointed out in his comment, a memmap is likely to be an alternative and elegant solution.
Remark on performance: After reading the comments I did a short benchmark. On my machine (16GB RAM, encrypted SSD) I was able to do 40000 random line reads in 24 seconds (with a 20000x40000 matrix of course, not the 10x10 from the example).
from __future__ import print_function
import numpy
import random
def dumparray(a, path):
lines, _ = a.shape
with open(path, 'wb') as fd:
for i in range(lines):
fd.write(a[i,...].tobytes())
class RandomLineAccess(object):
def __init__(self, path, cols, dtype):
self.dtype = dtype
self.fd = open(path, 'rb')
self.line_length = cols*dtype.itemsize
def read_line(self, line):
offset = line*self.line_length
self.fd.seek(offset)
data = self.fd.read(self.line_length)
return numpy.frombuffer(data, self.dtype)
def close(self):
self.fd.close()
def main():
lines = 10
cols = 10
path = '/tmp/array'
a = numpy.zeros((lines, cols))
dtype = a.dtype
for i in range(lines):
# add some data to distinguish lines
numpy.ndarray.fill(a[i,...], i)
dumparray(a, path)
rla = RandomLineAccess(path, cols, dtype)
line_indices = list(range(lines))
for _ in range(20):
line_index = random.choice(line_indices)
print(line_index, rla.read_line(line_index))
if __name__ == '__main__':
main()
Thanks everyone. I ended up finding a workaround (a machine with more RAM so I could actually load the dataset into memory).

Categories

Resources