How to load Pickle file in chunks? - python

Is there any option to load a pickle file in chunks?
I know we can save the data in CSV and load it in chunks.
But other than CSV, is there any option to load a pickle file or any python native file in chunks?

Based on the documentation for Python pickle, there is not currently support for chunking.
However, it is possible to split data into chunks and then read in chunks. For example, suppose the original structure is
import pickle
filename = "myfile.pkl"
str_to_save = "myname"
with open(filename,'wb') as file_handle:
pickle.dump(str_to_save, file_handle)
with open(filename,'rb') as file_handle:
result = pickle.load(file_handle)
print(result)
That could be split into two separate pickle files:
import pickle
filename_1 = "myfile_1.pkl"
filename_2 = "myfile_2.pkl"
str_to_save = "myname"
with open(filename_1,'wb') as file_handle:
pickle.dump(str_to_save[0:4], file_handle)
with open(filename_2,'wb') as file_handle:
pickle.dump(str_to_save[4:], file_handle)
with open(filename_1,'rb') as file_handle:
result = pickle.load(file_handle)
print(result)
As per AKX's comment, writing multiple data to a single file also works:
import pickle
filename = "myfile.pkl"
str_to_save = "myname"
with open(filename,'wb') as file_handle:
pickle.dump(str_to_save[0:4], file_handle)
pickle.dump(str_to_save[4:], file_handle)
with open(filename,'rb') as file_handle:
result = pickle.load(file_handle)
print(result)
result = pickle.load(file_handle)
print(result)

I had a similar issue, where I wrote a barrel file descriptor pool, and noticed that my pickle files were getting corrupt when I closed a file descriptor. Although you may do multiple dump() operations to an open file descriptor, it's not possible to subsequently do an open('file', 'ab') to start saving a new set of objects.
I got around this by doing a pickler.dump(None) as a session terminator right before I had to close the file descriptor, and upon re-opening, I instantiated a new Pickler instance to resume writing to the file.
When loading from this file, a None object signified an end-of-session, at which point I instantiated a new Pickler instance with the file descriptor to continue reading the remainder of the multi-session pickle file.
This only applies if for some reason you have to close the file descriptor, though. Otherwise, any number of dump() calls can be performed for load() later.

As far as I understand Pickle, load/dump by chunk is not possible.
Pickle intrinsically reads a complete data stream by "chunks" of variable length depending on flags within the data stream. That is what serialization is all about. This datastream itself could have been cut in chunk earlier (say, network transfer), but chunks cannot be pickle/unpickled "on the fly".
But maybe something intermediate can be achieved with pickle "buffers" and "out of band" features for very large data.
Note this is not exactly a pickle load/save a single pickle file in chunks. It only applies to objects met during the serialization process that declare themselves has being "out of band" (serialized separately).
Quoting the Pickler class doc:
If buffer_callback is not None, then it can be called any number of times with a buffer view. If the callback returns a false value (such as None), the given buffer is out-of-band; otherwise the buffer is serialized in-band, i.e. inside the pickle stream.
(emphasis mine)
Quoting the "Out of band" concept doc:
In some contexts, the pickle module is used to transfer massive amounts of data. Therefore, it can be important to minimize the number of memory copies, to preserve performance and resource consumption. However, normal operation of the pickle module, as it transforms a graph-like structure of objects into a sequential stream of bytes, intrinsically involves copying data to and from the pickle stream.
This constraint can be eschewed if both the provider (the implementation of the object types to be transferred) and the consumer (the implementation of the communications system) support the out-of-band transfer facilities provided by pickle protocol 5 and higher.
Example taken from the doc example :
b = ZeroCopyByteArray(b"abc") # NB: class has a special __reduce_ex__ and _reconstruct method
buffers = []
data = pickle.dumps(b, protocol=5, buffer_callback=buffers.append)
# we could do things with these buffers like:
# - writing each to a single file,
# - sending them over network,
# ...
new_b = pickle.loads(data, buffers=buffers) # load in chunks
From this example, we could consider writing each buffer into a file, or sending each on a network. Then unpickling would be performed by loading those files (or network payloads) and passing to the unpickle.
But note that we end up with 2 serialized data in the example:
data
buffers
Not really the OP desire, not exactly pickle load/dump by chunks.
From a pickle-to-a-single-file perspective, I don't think this gives any benefit, because we would have to define a custom method to pack into a file both data and buffers, i.e. define a new data format ... feels like ruining the pickle initial benefits.
Quoting Unpickler constructor doc:
If buffers is not None, it should be an iterable of buffer-enabled objects that is consumed each time the pickle stream references an out-of-band buffer view. Such buffers have been given in order to the buffer_callback of a Pickler object.
Changed in version 3.8: The buffers argument was added.

Related

Convert bytes to a file object in python

I have a small application that reads local files using:
open(diefile_path, 'r') as csv_file
open(diefile_path, 'r') as file
and also uses linecache module
I need to expand the use to files that send from a remote server.
The content that is received by the server type is bytes.
I couldn't find a lot of information about handling IOBytes type and I was wondering if there is a way that I can convert the bytes chunk to a file-like object.
My goal is to use the API is specified above (open,linecache)
I was able to convert the bytes into a string using data.decode("utf-8"),
but I can't use the methods above (open and linecache)
a small example to illustrate
data = 'b'First line\nSecond line\nThird line\n'
with open(data) as file:
line = file.readline()
print(line)
output:
First line
Second line
Third line
can it be done?
open is used to open actual files, returning a file-like object. Here, you already have the data in memory, not in a file, so you can instantiate the file-like object directly.
import io
data = b'First line\nSecond line\nThird line\n'
file = io.StringIO(data.decode())
for line in file:
print(line.strip())
However, if what you are getting is really just a newline-separated string, you can simply split it into a list directly.
lines = data.decode().strip().split('\n')
The main difference is that the StringIO version is slightly lazier; it has a smaller memory foot print compared to the list, as it splits strings off as requested by the iterator.
The answer above that using StringIO would need to specify an encoding, which may cause wrong conversion.
from Python Documentation using BytesIO:
from io import BytesIO
f = BytesIO(b"some initial binary data: \x00\x01")

Reading python pickle data before writing it to a file

Background:
Hi. I'm currently working on a project that mainly relies on Pickles to save the state of objects I have. Here is a code snippet of two functions I've written:
from Kiosk import * #The main class used for the lockers
import gc #Garbage Collector library. Used to get all instances of a class
import pickle #Library used to store variables in files.
def storeData(Lockers:Locker):
with open('lockerData', 'wb') as File:
pickle.dump(Lockers, File)
def readData():
with open('lockerData', 'rb') as File:
return pickle.load(File)
This pickle data will eventually be sent and received from a server using the Sockets library.
I've done some reading on the topic of Pickles and it seems like everyone agrees that pickles can be quite dangerous to use in some use cases as it's relatively easy to get them to execute unwanted code.
Objective:
For the above mentioned reasons I want to encrypt my pickle data in AES before writing it to the Pickle File, that way the pickle file is always encrypted even when sent and received form the server. My main problem now is, I don't know how to get the pickle data without writing it to the Pickle file first. pickle.dump() only allows me to write the pickle data to a file but doesn't allow me to get this pickle data straight away.
If I decide to do encryption after the pickle data has already been written to the file that would mean that there would be a period of time where the pickle data is stored in plain text, and I don't want that to happen.
Psudocode:
Here is how I'm expecting the task execution to flow:
PickleData = createPicle(Lockers)
PickleDataE = encrypt(PickleData)
with open('textfile.txt', 'wb') as File:
File.write(PickleDataE)
Question:
So my question is, how can I get the pickle data without writing it to a file?
You can store the pickle data to file as the encrypted data itself. When you read the encrypted data, you decrypt it to a variable. If you wrap that variable in an io.StringIO object, you can read it just like you do from a file, except it's in memory now. IF you give that a try, i'm sure future questions can help with how to read the decrypted data as if it were pickle data.

Pickle problem writing to file

I have a problem writing a file with Pickle in Python
Here is my code:
test = "TEST"
f1 = open(path+filename, "wb", 0)
pickle.dump(test,f1,0)
f1.close()
return
This gives me the output in the .txt file as VTESTp0. I'm not sure why this is?
Shouldn't it just have been saved as TEST?
I'm very new to pickle and I didn't even know it existed until today so sorry if I'm asking a silly question.
No, pickle does not write strings just as strings. Pickle is a serialization protocol, it turns objects into strings of bytes so that you can later recreate them. The actual format depends on which version of the protocol you use, but you should really treat pickle data as an opaque type.
If you want to write the string "TEST" to the file, just write the string itself. Don't bother with pickle.
Think of pickling as saving binary data to disk. This is interesting if you have data structures in your program like a big dict or array, which took some time to create. You can save them to a file with pickle and read them in with pickle the next time your program runs, thus saving you the time it took to build the data structure. The downside is that other, non-Python programs will not be able to understand the pickle files.
As pickle is quite versatile you can of course also write simple text strings to a pickle file. But if you want to process them further, e.g. in a text editor or by another program, you need to store them verbatim, as Thomas Wouters suggests:
test = "TEST"
f1 = open(path+filename, "wb", 0)
f1.write(test)
f1.close()
return

How do I read binary pickle data first, then unpickle it?

I'm unpickling a NetworkX object that's about 1GB in size on disk. Although I saved it in the binary format (using protocol 2), it is taking a very long time to unpickle this file---at least half an hour. The system I'm running on has plenty of system memory (128 GB), so that's not the bottleneck.
I've read here that pickling can be sped up by first reading the entire file into memory, and then unpickling it (that particular thread refers to python 3.0, which I'm not using, but the point should still be true in python 2.6).
How do I first read the binary file, and then unpickle it? I have tried:
import cPickle as pickle
f = open("big_networkx_graph.pickle","rb")
bin_data = f.read()
graph_data = pickle.load(bin_data)
But this returns:
TypeError: argument must have 'read' and 'readline' attributes
Any ideas?
pickle.load(file) expects a file-like object. Instead, use:
pickle.loads(string)
Read a pickled object hierarchy from a string. Characters in the string past the pickled object’s representation are ignored.
The documentation mentions StringIO, which I think is one possible solution.
Try:
f = open("big_networkx_graph.pickle","rb")
bin_data = f.read()
sio = StringIO(bin_data)
graph_data = pickle.load(sio)

memory use in large data-structures manipulation/processing

I have a number of large (~100 Mb) files which I'm regularly processing. While I'm trying to delete unneeded data structures during processing, memory consumption is a bit too high. I was wondering if there is a way to efficiently manipulate large data, e.g.:
def read(self, filename):
fc = read_100_mb_file(filename)
self.process(fc)
def process(self, content):
# do some processing of file content
Is there a duplication of data structures? Isn't it more memory efficient to use a class-wide attribute like self.fc?
When should I use garbage collection? I know about the gc module, but do I call it after I del fc for example?
update
p.s. 100 Mb is not a problem in itself. but float conversion, further processing add significantly more to both working set and virtual size (I'm on Windows).
I'd suggest looking at the presentation by David Beazley on using generators in Python. This technique allows you to handle a lot of data, and do complex processing, quickly and without blowing up your memory use. IMO, the trick isn't holding a huge amount of data in memory as efficiently as possible; the trick is avoiding loading a huge amount of data into memory at the same time.
Before you start tearing your hair out over the garbage collector, you might be able to avoid that 100mb hit of loading the entire file into memory by using a memory-mapped file object. See the mmap module.
Don't read the entire 100 meg file in at a time. Use streams to process a little bit at a time. Check out this blog post that talks about handling large csv and xml files. http://lethain.com/entry/2009/jan/22/handling-very-large-csv-and-xml-files-in-python/
Here is a sample of the code from the article.
from __future__ import with_statement # for python 2.5
with open('data.in','r') as fin:
with open('data.out','w') as fout:
for line in fin:
fout.write(','.join(line.split(' ')))
So, from your comments I assume that your file looks something like this:
item1,item2,item3,item4,item5,item6,item7,...,itemn
which you all reduce to a single value by repeated application of some combination function. As a solution, only read a single value at a time:
def read_values(f):
buf = []
while True:
c = f.read(1)
if c == ",":
yield parse("".join(buf))
buf = []
elif c == "":
yield parse("".join(buf))
return
else:
buf.append(c)
with open("some_file", "r") as f:
agg = initial
for v in read_values(f):
agg = combine(agg, v)
This way, memory consumption stays constant, unless agg grows in time.
Provide appropriate implementations of initial, parse and combine
Don't read the file byte-by-byte, but read in a fixed buffer, parse from the buffer and read more as you need it
This is basically what the builtin reduce function does, but I've used an explicit for loop here for clarity. Here's the same thing using reduce:
with open("some_file", "r") as f:
agg = reduce(combine, read_values(f), initial)
I hope I interpreted your problem correctly.
First of all, don't touch the garbage collector. That's not the problem, nor the solution.
It sounds like the real problem you're having is not with the file reading at all, but with the data structures that you're allocating as you process the files.
Condering using del to remove structures that you no longer need during processing. Also, you might consider using marshal to dump some of the processed data to disk while you work through the next 100mb of input files.
For file reading, you have basically two options: unix-style files as streams, or memory mapped files. For streams-based files, the default python file object is already buffered, so the simplest code is also probably the most efficient:
with open("filename", "r") as f:
for line in f:
# do something with a line of the files
Alternately, you can use f.read([size]) to read blocks of the file. However, usually you do this to gain CPU performance, by multithreading the processing part of your script, so that you can read and process at the same time. But it doesn't help with memory usage; in fact, it uses more memory.
The other option is mmap, which looks like this:
with open("filename", "r+") as f:
map = mmap.mmap(f.fileno(), 0)
line = map.readline()
while line != '':
# process a line
line = map.readline()
This sometimes outperforms streams, but it also won't improve memory usage.
In your example code, data is being stored in the fc variable. If you don't keep a reference to fc around, your entire file contents will be removed from memory when the read method ends.
If they are not, then you are keeping a reference somewhere. Maybe the reference is being created in read_100_mb_file, maybe in process. If there is no reference, CPython implementation will deallocate it almost immediatelly.
There are some tools to help you find where this reference is, guppy, dowser, pysizer...

Categories

Resources