I have a large dataset: 20,000 x 40,000 as a numpy array. I have saved it as a pickle file.
Instead of reading this huge dataset into memory, I'd like to only read a few (say 100) rows of it at a time, for use as a minibatch.
How can I read only a few randomly-chosen (without replacement) lines from a pickle file?
You can write pickles incrementally to a file, which allows you to load them
incrementally as well.
Take the following example. Here, we iterate over the items of a list, and
pickle each one in turn.
>>> import cPickle
>>> myData = [1, 2, 3]
>>> f = open('mydata.pkl', 'wb')
>>> pickler = cPickle.Pickler(f)
>>> for e in myData:
... pickler.dump(e)
<cPickle.Pickler object at 0x7f3849818f68>
<cPickle.Pickler object at 0x7f3849818f68>
<cPickle.Pickler object at 0x7f3849818f68>
>>> f.close()
Now we can do the same process in reverse and load each object as needed. For
the purpose of example, let's say that we just want the first item and don't
want to iterate over the entire file.
>>> f = open('mydata.pkl', 'rb')
>>> unpickler = cPickle.Unpickler(f)
>>> unpickler.load()
1
At this point, the file stream has only advanced as far as the first
object. The remaining objects weren't loaded, which is exactly the behavior you
want. For proof, you can try reading the rest of the file and see the rest is
still sitting there.
>>> f.read()
'I2\n.I3\n.'
Since you do not know the internal workings of pickle, you need to use another storing method. The script below uses the tobytes() functions to save the data line-wise in a raw file.
Since the length of each line is known, it's offset in the file can be computed and accessed via seek() and read(). After that, it is converted back to an array with the frombuffer() function.
The big disclaimer however is that the size of the array in not saved (this could be added as well but requires some more complications) and that this method might not be as portable as a pickled array.
As #PadraicCunningham pointed out in his comment, a memmap is likely to be an alternative and elegant solution.
Remark on performance: After reading the comments I did a short benchmark. On my machine (16GB RAM, encrypted SSD) I was able to do 40000 random line reads in 24 seconds (with a 20000x40000 matrix of course, not the 10x10 from the example).
from __future__ import print_function
import numpy
import random
def dumparray(a, path):
lines, _ = a.shape
with open(path, 'wb') as fd:
for i in range(lines):
fd.write(a[i,...].tobytes())
class RandomLineAccess(object):
def __init__(self, path, cols, dtype):
self.dtype = dtype
self.fd = open(path, 'rb')
self.line_length = cols*dtype.itemsize
def read_line(self, line):
offset = line*self.line_length
self.fd.seek(offset)
data = self.fd.read(self.line_length)
return numpy.frombuffer(data, self.dtype)
def close(self):
self.fd.close()
def main():
lines = 10
cols = 10
path = '/tmp/array'
a = numpy.zeros((lines, cols))
dtype = a.dtype
for i in range(lines):
# add some data to distinguish lines
numpy.ndarray.fill(a[i,...], i)
dumparray(a, path)
rla = RandomLineAccess(path, cols, dtype)
line_indices = list(range(lines))
for _ in range(20):
line_index = random.choice(line_indices)
print(line_index, rla.read_line(line_index))
if __name__ == '__main__':
main()
Thanks everyone. I ended up finding a workaround (a machine with more RAM so I could actually load the dataset into memory).
Related
I am writing some code which needs to save a very large numpy array to memory. The numpy array is so large in fact that I cannot load it all into memory at once. But I can calculate the array in chunks. I.e. my code looks something like:
for i in np.arange(numberOfChunks):
myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = #... do some calculation
As I can't load myArray into memory all at once, I want to save it to a file one "chunk" at a time. i.e. I want to do something like this:
for i in np.arange(numberOfChunks):
myArrayChunk = #... do some calculation to obtain chunk
saveToFile(myArrayChunk, indicesInFile=[(i*chunkSize):(i*(chunkSize+1)),:,:], filename)
I understand this can be done with h5py but I am a little confused how to do this. My current understanding is that I can do this:
import h5py
# Make the file
h5py_file = h5py.File(filename, "a")
# Tell it we are going to store a dataset
myArray = h5py_file.create_dataset("myArray", myArrayDimensions, compression="gzip")
for i in np.arange(numberOfChunks):
myArrayChunk = #... do some calculation to obtain chunk
myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = myArrayChunk
But this is where I become a little confused. I have read that if you index a h5py datatype like I did when I wrote myArray[(i*chunkSize):(i*(chunkSize+1)),:,:], then this part of myArray has now been read into memory. So surely, by the end of my loop above, have I not still got the whole of myArray in memory now? How has this saved my memory?
Similarly, later on, I would like to read in my file back in one chunk at a time, doing further calculation. i.e. I would like to do something like:
import h5py
# Read in the file
h5py_file = h5py.File(filename, "a")
# Read in myArray
myArray = h5py_file['myArray']
for i in np.arange(numberOfChunks):
# Read in chunk
myArrayChunk = myArray[(i*chunkSize):(i*(chunkSize+1)),:,:]
# ... Do some calculation on myArrayChunk
But by the end of this loop is the whole of myArray now in memory? I am a little confused by when myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] is in memory and when it isn't. Please could someone explain this.
You have the basic idea. Take care when saying "save to memory". NumPy arrays are saved in memory (RAM). HDF5 data is saved on disk (not to memory/RAM!), then accessed (memory used depends on how you access). In the first step you are creating and writing data in chunks to the disk. In the second step you are accessing data from disk in chunks. Working example provided at the end.
When reading data with h5py there 2 ways to read the data:
This returns a NumPy array:
myArrayNP = myArray[:,:,:]
This returns a h5py dataset object that operates like a NumPy array:
myArrayDS = myArray
The difference: h5py dataset objects are not read into memory all at once. You can then slice them as needed. Continuing from above, this is a valid operation to get a subset of the data:
myArrayChunkNP = myArrayDS[i*chunkSize):(i+1)*chunkSize),:,:]
My example also corrects 1 small error in your chunksize increment equation.
You had:
myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = myArrayChunk
You want:
myArray[(i*chunkSize):(i+1)*chunkSize),:,:] = myArrayChunk
Working Example (writes and reads):
import h5py
import numpy as np
# Make the file
with h5py.File("SO_61173314.h5", "w") as h5w:
numberOfChunks = 3
chunkSize = 4
print( 'WRITING %d chunks with w/ chunkSize=%d ' % (numberOfChunks,chunkSize) )
# Write dataset to disk
h5Array = h5w.create_dataset("myArray", (numberOfChunks*chunkSize,2,2), compression="gzip")
for i in range(numberOfChunks):
h5ArrayChunk = np.random.random(chunkSize*2*2).reshape(chunkSize,2,2)
print (h5ArrayChunk)
h5Array[(i*chunkSize):((i+1)*chunkSize),:,:] = h5ArrayChunk
with h5py.File("SO_61173314.h5", "r") as h5r:
print( '/nREADING %d chunks with w/ chunkSize=%d/n' % (numberOfChunks,chunkSize) )
# Access myArray dataset - Note: This is NOT a NumpPy array
myArray = h5r['myArray']
for i in range(numberOfChunks):
# Read a chunk into memory (as a NumPy array)
myArrayChunk = myArray[(i*chunkSize):((i+1)*chunkSize),:,:]
# ... Do some calculation on myArrayChunk
print (myArrayChunk)
I have a large input file which consists of data frames (a data series (complex64), with an identifying header in each frame). It is larger than my available memory. The headers repeat, but are randomly ordered, so for example the input file could look like:
<FRAME header={0}, data={**first** 500 numbers...}>,
<FRAME header={18}, data={first 500 numbers...}>,
<FRAME header={4}, data={first 500 numbers...}>,
<FRAME header={0}, data={**next** 500 numbers...}>
...
I want to order the data into a new file that is a numpy array of shape (len(headers), len(data_series)). It has to build the output file as it reads the frames, because I can't fit it all in memory.
I've looked at numpy.savetxt and the python csv package but for disk size, precision, and speed reasons I would prefer for the output file to be binary. numpy.save is good except that I can't figure out how to make it append to an unknown array size.
I have to work in Python2.7 because of some dependencies needed to read these frames. What I have done so far is made a function able to write all of the frames with a matching header to a single binary file:
input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)
with open("singleFrameHeader", 'ab') as f:
current_data = input_data.readFrame() # This loads the next frame in the file
if current_data.header == 0:
float_arr = np.array(current_data.data).view(float)
float_arr.tofile(f)
This works great, but what I need to extend it to be two dimensional. I'm starting to look at h5py as an option, but was hoping there is a simpler solution.
What would be great is something like
input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)
with open("bigMatrix", 'ab') as f:
current_data = input_data.readFrame() # This loads the next frame in the file
index = current_data.header
float_arr = np.array(current_data.data).view(float)
float_arr.tofile(f, index)
Any help is appreciated. I thought this would be a more common use-case to read and write to a 2D binary file in append mode.
You have two problems: one is that a file contains sequential data, and the other is that numpy binary files don't store shape information.
A simple way to start solving this would be to carry through with your initial idea of converting the data into files by header, then combining all the binary files into one large product (if you still feel the need to do so).
You could maintain a map of the headers you've found so far to their output files, data size, etc. This will allow you to combine the data more intelligently, if for example, there are missing chunks or headers or something.
from contextlib import ExitStack
from os import remove
from tempfile import NamedTemporaryFile
from shutil import copyfileobj
import sys
class Header:
__slots__ = ('id', 'count', 'file', 'name')
def __init__(self, id):
self.id = id
self.count = 0
self.file = NamedTemporaryFile(delete=False)
self.name = self.file.name
def write_frame(self, frame):
data = np.array(frame.data).view(float)
self.count += data.size
data.tofile(self.file)
input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)
file_map = {}
with ExitStack() as stack:
while True:
frame = input_data.next_frame()
if frame is None:
break # recast this loop as necessary
if frame.header not in file_map:
header = Header(frame.header)
stack.enter_context(header.file)
file_map[frame.header] = header
else:
header = file_map[frame.header]
header.write_frame(frame)
max_header = max(file_map)
max_count = max(h.count for h in file_map)
with open('singleFrameHeader', 'wb') as output:
output.write(max_header.to_bytes(8, sys.byteorder))
output.write(max_count.to_bytes(8, sys.byteorder))
for i in range max_header:
if i in file_map:
h = file_map[i]
with open(h.name, 'rb') as input:
copyfileobj(input, output)
remove(h.name)
if h.count < max_count:
np.full(max_count - h.count, np.nan, dtype=np.float).tofile(output)
else:
np.full(max_count, np.nan, dtype=np.float).tofile(output)
The first 16 bytes will be the int64 number of headers and number of elements per header, respectively. Keep in mind that the file is in native byte order, whatever that may be, and is therefore not portable.
Alternative
If (and only if) you know the exact size of a header dataset ahead of time, you can do this in one pass, with no temporary files. It also helps if the headers are contiguous. Otherwise, missing swaths will be zero-filled. You will still need to maintain a dictionary of your current position within a header, but you will no longer have to keep a separate file pointer around for each one. All-in-all, this is a much better alternative than the original solution, if your use-case allows it:
header_size = 500 * N # You must know this up front
input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)
header_map = {}
with open('singleFrameHeader', 'wb') as output:
output.write(max_header.to_bytes(8, sys.byteorder))
output.write(max_count.to_bytes(8, sys.byteorder))
while True:
frame = input_data.next__frame()
if frame is None:
break
if frame.header not in header_map:
header_map[frame.header] = 0
data = np.array(frame.data).view(float)
output.seek(16 + frame.header * header_size + header_map[frame.header])
data.tofile(output)
header_map[frame.header] += data.size * data.dtype.itemsize
I asked a question regarding this sort of out-of-order write pattern as a consequence of this answer: What happens when you seek past the end of a file opened for writing?
I have a huge dictionary with numpy arrays as values which consumes almost all RAM. There is no possibility to pickle or compress it entirely. I've checked some of solutions to read/write in chunks using zlib, but they work with files, StringIO, etc, when I want to read/write from/into RAM.
Here is the closest example to what I want, but it has only writing part. How can I read the object after saving this way, because chunks were written together and compressed chunks of course have different length?
import zlib
class ZlibWrapper():
# chunksize is used to save memory, otherwise huge object will be copied
def __init__(self, filename, chunksize=268435456): # 256 MB
self.filename = filename
self.chunksize = chunksize
def save(self, data):
"""Saves a compressed object to disk
"""
mdata = memoryview(data)
with open(self.filename, 'wb') as f:
for i in range(0, len(mdata), self.chunksize):
mychunk = zlib.compress(bytes(mdata[i:i+self.chunksize]))
f.write(mychunk)
def load(self):
# ???
return data
Uncompressed objects unfortunately would be too huge to be sent over network, and zipping them externally would create additional complications.
Pickle unfortunately starts to consume RAM and system hangs.
Following the discussion with Charles Duffy, here is my attempt of serialization (does not work at the moment - does not even compress the strings):
import zlib
import json
import numpy as np
mydict = {"a":np.array([1,2,3]),"b":np.array([4,5,6]),"c":np.array([0,0,0])}
#------------
# write to compressed stream ---------------------
def string_stream_serialization(dic):
for key, val in dic.items():
#key_encoded = key.encode("utf-8") # is not json serializable
yield json.dumps([key,val.tolist()])
output = ""
compressor = zlib.compressobj()
decompressor = zlib.decompressobj()
stream = string_stream_serialization(mydict)
with open("outfile.compressed", "wb") as f:
for s in stream:
if not s:
f.write(compressor.flush())
break
f.write(compressor.compress(s.encode('utf-8'))) # .encode('utf-8') converts to bytes
# read from compressed stream: --------------------
def read_in_chunks(file_object, chunk_size=1024): # I set another chunk size intentionally
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
reconstructed = {}
with open("outfile.compressed", "rb") as f:
for s in read_in_chunks(f):
data = decompressor.decompress(decompressor.unconsumed_tail + s)
while data:
arr = json.loads(data.decode("utf-8"))
reconstructed[arr[0]] = np.array(arr[1])
data = decompressor.decompress(decompressor.unconsumed_tail)
print(reconstructed)
Your first focus should be on having a sane way to serialize and deserialize your data. We have several constraints about your data provided in the question itself, or in comments on same:
Your data consists of a dictionary with a very large number of key/value pairs
All keys are unicode strings
All values are numpy arrays which are individually short enough to easily fit in memory at any given time (or even to allow multiple copies of any single value), although in aggregate the storage required becomes extremely large.
This suggests a fairly simple implementation:
def serialize(f, content):
for k,v in content.items():
# write length of key, followed by key as string
k_bstr = k.encode('utf-8')
f.write(struct.pack('L', len(k_bstr)))
f.write(k_bstr)
# write length of value, followed by value in numpy.save format
memfile = io.BytesIO()
numpy.save(memfile, v)
f.write(struct.pack('L', memfile.tell()))
f.write(memfile.getvalue())
def deserialize(f):
retval = {}
while True:
content = f.read(struct.calcsize('L'))
if not content: break
k_len = struct.unpack('L', content)[0]
k_bstr = f.read(k_len)
k = k_bstr.decode('utf-8')
v_len = struct.unpack('L', f.read(struct.calcsize('L')))[0]
v_bytes = io.BytesIO(f.read(v_len))
v = numpy.load(v_bytes)
retval[k] = v
return retval
As a simple test:
test_file = io.BytesIO()
serialize(test_file, {
"First Key": numpy.array([123,234,345]),
"Second Key": numpy.array([321,432,543]),
})
test_file.seek(0)
print(deserialize(test_file))
...so, we've got that -- now, how do we add compression? Easily.
with gzip.open('filename.gz', 'wb') as gzip_file:
serialize(gzip_file, your_data)
...or, on the decompression side:
with gzip.open('filename.gz', 'rb') as gzip_file:
your_data = deserialize(gzip_file)
This works because the gzip library already streams data out as it's requested, rather than compressing it or decompressing it all at once. There's no need to do windowing and chunking yourself -- just leave it to the lower layer.
To write a dictionary to disk, the zipfile module is a good fit.
When saving - Save each chunk as a file in the zip.
When loading - Iterate over the files in the zip and rebuild the data.
I made a pickle file, storing a grayscale value of each pixel in 100,000 80x80 sized images.
(Plus an array of 100,000 integers whose values are one-digit).
My approximation for the total size of the pickle is,
4 byte x 80 x 80 x 100000 = 2.88 GB
plus the array of integers, which shouldn't be that large.
The generated pickle file however is over 16GB, so it's taking hours just to unpickle it and load it, and it eventually freezes, after it takes full memory resources.
Is there something wrong with my calculation or is it the way I pickled it?
I pickled the file in the following way.
from PIL import Image
import pickle
import os
import numpy
import time
trainpixels = numpy.empty([80000,6400])
trainlabels = numpy.empty(80000)
validpixels = numpy.empty([10000,6400])
validlabels = numpy.empty(10000)
testpixels = numpy.empty([10408,6400])
testlabels = numpy.empty(10408)
i=0
tr=0
va=0
te=0
for (root, dirs, filenames) in os.walk(indir1):
print 'hello'
for f in filenames:
try:
im = Image.open(os.path.join(root,f))
Imv=im.load()
x,y=im.size
pixelv = numpy.empty(6400)
ind=0
for ii in range(x):
for j in range(y):
temp=float(Imv[j,ii])
temp=float(temp/255.0)
pixelv[ind]=temp
ind+=1
if i<40000:
trainpixels[tr]=pixelv
tr+=1
elif i<45000:
validpixels[va]=pixelv
va+=1
else:
testpixels[te]=pixelv
te+=1
print str(i)+'\t'+str(f)
i+=1
except IOError:
continue
trainimage=(trainpixels,trainlabels)
validimage=(validpixels,validlabels)
testimage=(testpixels,testlabels)
output=open('data.pkl','wb')
pickle.dump(trainimage,output)
pickle.dump(validimage,output)
pickle.dump(testimage,output)
Please let me know if you see something wrong with either my calculation or my code!
Python Pickles are not a thrifty mechanism for storing data as you're storing objects instead of "just the data."
The following test case takes 24kb on my system and this is for a small, sparsely populated numpy array stored in a pickle:
import os
import sys
import numpy
import pickle
testlabels = numpy.empty(1000)
testlabels[0] = 1
testlabels[99] = 0
test_labels_size = sys.getsizeof(testlabels) #80
output = open('/tmp/pickle', 'wb')
test_labels_pickle = pickle.dump(testlabels, output)
print os.path.getsize('/tmp/pickle')
Further, I'm not sure why you believe 4kb to be the size of a number in Python -- non-numpy ints are 24 bytes (sys.getsizeof(1)) and numpy arrays are a minimum of 80 bytes (sys.getsizeof(numpy.array([0], float))).
As you stated as a response to my comment, you have reasons for staying with Pickle, so I won't try to convince you further to not store objects, but be aware of the overhead of storing objects.
As an option: reduce the size of your training data/Pickle fewer objects.
There are several similar questions but none of them answers this simple question directly:
How can i catch a commands output and stream that content into numpy arrays without creating a temporary string object to read from?
So, what I would like to do is this:
import subprocess
import numpy
import StringIO
def parse_header(fileobject):
# this function moves the filepointer and returns a dictionary
d = do_some_parsing(fileobject)
return d
sio = StringIO.StringIO(subprocess.check_output(cmd))
d = parse_header(sio)
# now the file pointer is at the start of data, parse_header takes care of that.
# ALL of the data is now available in the next line of sio
dt = numpy.dtype([(key, 'f8') for key in d.keys()])
# i don't know how do make this work:
data = numpy.fromxxxx(sio , dt)
# if i would do this, I create another copy besides the StringIO object, don't I?
# so this works, but isn't this 'bad' ?
datastring = sio.read()
data = numpy.fromstring(datastring, dtype=dt)
I tried it with StringIO and cStringIO but both are not accepted by numpy.frombuffer and numpy.fromfile.
Using StringIO object I first have to read the stream into a string and then use numpy.fromstring, but I would like to avoid creating the intermediate object (several Gigabytes).
An alternative for me would be if I can stream sys.stdin into numpy arrays, but that does not work with numpy.fromfile either (seek needs to be implemented).
Are there any work-arounds for this? I can't be the first one trying this (unless this is a PEBKAC case?)
Solution:
This is the current solution, it's a mix of unutbu's instruction how to use the Popen with PIPE and the hint of eryksun to use bytearray, so I don't know who to accept!? :S
proc = sp.Popen(cmd, stdout = sp.PIPE, shell=True)
d = parse_des_header(proc.stdout)
rec_dtype = np.dtype([(key,'f8') for key in d.keys()])
data = bytearray(proc.stdout.read())
ndata = np.frombuffer(data, dtype = rec_dtype)
I didn't check if the data is really not creating another copy, don't know how. But what I noticed that this works much faster than everything I tried before, so many thanks to both the answers' authors!
Update 2022:
I just tried above solution steps without the bytearray() step and it just works fine. Thanks to Python 3 I guess?
You can use Popen with stdout=subprocess.PIPE. Read in the header, then load the rest into a bytearray to use with np.frombuffer.
Additional comments based on your edit:
If you're going to call proc.stdout.read(), it's equivalent to using check_output(). Both create a temporary string. If you preallocate data, you could use proc.stdout.readinto(data). Then if the number of bytes read into data is less than len(data), free the excess memory, else extend data by whatever is left to be read.
data = bytearray(2**32) # 4 GiB
n = proc.stdout.readinto(data)
if n < len(data):
data[n:] = ''
else:
data += proc.stdout.read()
You could also come at this starting with a pre-allocated ndarray ndata and use buf = np.getbuffer(ndata). Then readinto(buf) as above.
Here's an example to show that the memory is shared between the bytearray and the np.ndarray:
>>> data = bytearray('\x01')
>>> ndata = np.frombuffer(data, np.int8)
>>> ndata
array([1], dtype=int8)
>>> ndata[0] = 2
>>> data
bytearray(b'\x02')
Since your data can easily fit in RAM, I think the easiest way to load the data into a numpy array is to use a ramfs.
On Linux,
sudo mkdir /mnt/ramfs
sudo mount -t ramfs -o size=5G ramfs /mnt/ramfs
sudo chmod 777 /mnt/ramfs
Then, for example, if this is the producer of the binary data:
writer.py:
from __future__ import print_function
import random
import struct
N = random.randrange(100)
print('a b')
for i in range(2*N):
print(struct.pack('<d',random.random()), end = '')
Then you could load it into a numpy array like this:
reader.py:
import subprocess
import numpy
def parse_header(f):
# this function moves the filepointer and returns a dictionary
header = f.readline()
d = dict.fromkeys(header.split())
return d
filename = '/mnt/ramfs/data.out'
with open(filename, 'w') as f:
cmd = 'writer.py'
proc = subprocess.Popen([cmd], stdout = f)
proc.communicate()
with open(filename, 'r') as f:
header = parse_header(f)
dt = numpy.dtype([(key, 'f8') for key in header.keys()])
data = numpy.fromfile(f, dt)