How do I read binary pickle data first, then unpickle it? - python

I'm unpickling a NetworkX object that's about 1GB in size on disk. Although I saved it in the binary format (using protocol 2), it is taking a very long time to unpickle this file---at least half an hour. The system I'm running on has plenty of system memory (128 GB), so that's not the bottleneck.
I've read here that pickling can be sped up by first reading the entire file into memory, and then unpickling it (that particular thread refers to python 3.0, which I'm not using, but the point should still be true in python 2.6).
How do I first read the binary file, and then unpickle it? I have tried:
import cPickle as pickle
f = open("big_networkx_graph.pickle","rb")
bin_data = f.read()
graph_data = pickle.load(bin_data)
But this returns:
TypeError: argument must have 'read' and 'readline' attributes
Any ideas?

pickle.load(file) expects a file-like object. Instead, use:
pickle.loads(string)
Read a pickled object hierarchy from a string. Characters in the string past the pickled object’s representation are ignored.

The documentation mentions StringIO, which I think is one possible solution.
Try:
f = open("big_networkx_graph.pickle","rb")
bin_data = f.read()
sio = StringIO(bin_data)
graph_data = pickle.load(sio)

Related

Convert bytes to a file object in python

I have a small application that reads local files using:
open(diefile_path, 'r') as csv_file
open(diefile_path, 'r') as file
and also uses linecache module
I need to expand the use to files that send from a remote server.
The content that is received by the server type is bytes.
I couldn't find a lot of information about handling IOBytes type and I was wondering if there is a way that I can convert the bytes chunk to a file-like object.
My goal is to use the API is specified above (open,linecache)
I was able to convert the bytes into a string using data.decode("utf-8"),
but I can't use the methods above (open and linecache)
a small example to illustrate
data = 'b'First line\nSecond line\nThird line\n'
with open(data) as file:
line = file.readline()
print(line)
output:
First line
Second line
Third line
can it be done?
open is used to open actual files, returning a file-like object. Here, you already have the data in memory, not in a file, so you can instantiate the file-like object directly.
import io
data = b'First line\nSecond line\nThird line\n'
file = io.StringIO(data.decode())
for line in file:
print(line.strip())
However, if what you are getting is really just a newline-separated string, you can simply split it into a list directly.
lines = data.decode().strip().split('\n')
The main difference is that the StringIO version is slightly lazier; it has a smaller memory foot print compared to the list, as it splits strings off as requested by the iterator.
The answer above that using StringIO would need to specify an encoding, which may cause wrong conversion.
from Python Documentation using BytesIO:
from io import BytesIO
f = BytesIO(b"some initial binary data: \x00\x01")

How to load Pickle file in chunks?

Is there any option to load a pickle file in chunks?
I know we can save the data in CSV and load it in chunks.
But other than CSV, is there any option to load a pickle file or any python native file in chunks?
Based on the documentation for Python pickle, there is not currently support for chunking.
However, it is possible to split data into chunks and then read in chunks. For example, suppose the original structure is
import pickle
filename = "myfile.pkl"
str_to_save = "myname"
with open(filename,'wb') as file_handle:
pickle.dump(str_to_save, file_handle)
with open(filename,'rb') as file_handle:
result = pickle.load(file_handle)
print(result)
That could be split into two separate pickle files:
import pickle
filename_1 = "myfile_1.pkl"
filename_2 = "myfile_2.pkl"
str_to_save = "myname"
with open(filename_1,'wb') as file_handle:
pickle.dump(str_to_save[0:4], file_handle)
with open(filename_2,'wb') as file_handle:
pickle.dump(str_to_save[4:], file_handle)
with open(filename_1,'rb') as file_handle:
result = pickle.load(file_handle)
print(result)
As per AKX's comment, writing multiple data to a single file also works:
import pickle
filename = "myfile.pkl"
str_to_save = "myname"
with open(filename,'wb') as file_handle:
pickle.dump(str_to_save[0:4], file_handle)
pickle.dump(str_to_save[4:], file_handle)
with open(filename,'rb') as file_handle:
result = pickle.load(file_handle)
print(result)
result = pickle.load(file_handle)
print(result)
I had a similar issue, where I wrote a barrel file descriptor pool, and noticed that my pickle files were getting corrupt when I closed a file descriptor. Although you may do multiple dump() operations to an open file descriptor, it's not possible to subsequently do an open('file', 'ab') to start saving a new set of objects.
I got around this by doing a pickler.dump(None) as a session terminator right before I had to close the file descriptor, and upon re-opening, I instantiated a new Pickler instance to resume writing to the file.
When loading from this file, a None object signified an end-of-session, at which point I instantiated a new Pickler instance with the file descriptor to continue reading the remainder of the multi-session pickle file.
This only applies if for some reason you have to close the file descriptor, though. Otherwise, any number of dump() calls can be performed for load() later.
As far as I understand Pickle, load/dump by chunk is not possible.
Pickle intrinsically reads a complete data stream by "chunks" of variable length depending on flags within the data stream. That is what serialization is all about. This datastream itself could have been cut in chunk earlier (say, network transfer), but chunks cannot be pickle/unpickled "on the fly".
But maybe something intermediate can be achieved with pickle "buffers" and "out of band" features for very large data.
Note this is not exactly a pickle load/save a single pickle file in chunks. It only applies to objects met during the serialization process that declare themselves has being "out of band" (serialized separately).
Quoting the Pickler class doc:
If buffer_callback is not None, then it can be called any number of times with a buffer view. If the callback returns a false value (such as None), the given buffer is out-of-band; otherwise the buffer is serialized in-band, i.e. inside the pickle stream.
(emphasis mine)
Quoting the "Out of band" concept doc:
In some contexts, the pickle module is used to transfer massive amounts of data. Therefore, it can be important to minimize the number of memory copies, to preserve performance and resource consumption. However, normal operation of the pickle module, as it transforms a graph-like structure of objects into a sequential stream of bytes, intrinsically involves copying data to and from the pickle stream.
This constraint can be eschewed if both the provider (the implementation of the object types to be transferred) and the consumer (the implementation of the communications system) support the out-of-band transfer facilities provided by pickle protocol 5 and higher.
Example taken from the doc example :
b = ZeroCopyByteArray(b"abc") # NB: class has a special __reduce_ex__ and _reconstruct method
buffers = []
data = pickle.dumps(b, protocol=5, buffer_callback=buffers.append)
# we could do things with these buffers like:
# - writing each to a single file,
# - sending them over network,
# ...
new_b = pickle.loads(data, buffers=buffers) # load in chunks
From this example, we could consider writing each buffer into a file, or sending each on a network. Then unpickling would be performed by loading those files (or network payloads) and passing to the unpickle.
But note that we end up with 2 serialized data in the example:
data
buffers
Not really the OP desire, not exactly pickle load/dump by chunks.
From a pickle-to-a-single-file perspective, I don't think this gives any benefit, because we would have to define a custom method to pack into a file both data and buffers, i.e. define a new data format ... feels like ruining the pickle initial benefits.
Quoting Unpickler constructor doc:
If buffers is not None, it should be an iterable of buffer-enabled objects that is consumed each time the pickle stream references an out-of-band buffer view. Such buffers have been given in order to the buffer_callback of a Pickler object.
Changed in version 3.8: The buffers argument was added.

Convert file into BytesIO object using python

I have a file and want to convert it into BytesIO object so that it can be stored in database's varbinary column.
Please can anyone help me convert it using python.
Below is my code:
f = open(filepath, "rb")
print(f.read())
myBytesIO = io.BytesIO(f)
myBytesIO.seek(0)
print(type(myBytesIO))
Opening a file with open and mode read-binary already gives you a Binary I/O object.
Documentation:
The easiest way to create a binary stream is with open() with 'b' in the mode string:
f = open("myfile.jpg", "rb")
So in normal circumstances, you'd be fine just passing the file handle wherever you need to supply it. If you really want/need to get a BytesIO instance, just pass the bytes you've read from the file when creating your BytesIO instance like so:
from io import BytesIO
with open(filepath, "rb") as fh:
buf = BytesIO(fh.read())
This has the disadvantage of loading the entire file into memory, which might be avoidable if the code you're passing the instance to is smart enough to stream the file without keeping it in memory. Note that the example uses open as a context manager that will reliably close the file, even in case of errors.

What is pickle doing?

I have used Python for years. I have used pickle extensively. I cannot figure out what this is doing:
with codecs.open("huge_picklefile.pc", "rb") as f:
data = pickle.load(f)
print(len(data))
data = pickle.load(f)
print(len(data))
data = pickle.load(f)
print(len(data))
This returns to me:
335
59
12
I am beyond confused. I am use to pickle loading the massive file into memory. The object itself is a massive array of arrays (I assume). Could it be comprised of multiple pickle objects? Unfortunately, I didn't create the pickle object and I don't have access to who did.
I cannot figure out why pickle is splitting up my file into chunks, which isn't the default, and I am not telling it to. What does reloading the same file do? I honestly never tried or even came across a use case until now.
I spent a good 5 hours trying to figure out how to even ask this question on Google. Unsurprisingly, trying "multiple pickle loads on the same document" doesn't yield anything too useful. The Python 3.7 pickle docs does not describe this behavior. I can't figure out how repeatedly loading a pickle document doesn't (a) crash or (b) load the entire thing into memory and then just reference itself. In my 15 years of using python I have never run into this problem... so I am taking a leap of faith that this is just weird and we should probably just use a database instead.
This file is not quite a pickle file. Someone has dumped multiple pickles into the same file, resulting in the file contents being a concatenation of multiple pickles. When you call pickle.load(f), pickle will read the file from the current file position until it finds a pickle end, so each pickle.load call will load the next pickle.
You can create such a file yourself by calling pickle.dump repeatedly:
with open('demofile', 'wb') as f:
pickle.dump([1, 2, 3], f)
pickle.dump([10, 20], f)
pickle.dump([0, 0, 0], f)

Pickle problem writing to file

I have a problem writing a file with Pickle in Python
Here is my code:
test = "TEST"
f1 = open(path+filename, "wb", 0)
pickle.dump(test,f1,0)
f1.close()
return
This gives me the output in the .txt file as VTESTp0. I'm not sure why this is?
Shouldn't it just have been saved as TEST?
I'm very new to pickle and I didn't even know it existed until today so sorry if I'm asking a silly question.
No, pickle does not write strings just as strings. Pickle is a serialization protocol, it turns objects into strings of bytes so that you can later recreate them. The actual format depends on which version of the protocol you use, but you should really treat pickle data as an opaque type.
If you want to write the string "TEST" to the file, just write the string itself. Don't bother with pickle.
Think of pickling as saving binary data to disk. This is interesting if you have data structures in your program like a big dict or array, which took some time to create. You can save them to a file with pickle and read them in with pickle the next time your program runs, thus saving you the time it took to build the data structure. The downside is that other, non-Python programs will not be able to understand the pickle files.
As pickle is quite versatile you can of course also write simple text strings to a pickle file. But if you want to process them further, e.g. in a text editor or by another program, you need to store them verbatim, as Thomas Wouters suggests:
test = "TEST"
f1 = open(path+filename, "wb", 0)
f1.write(test)
f1.close()
return

Categories

Resources