I have used Python for years. I have used pickle extensively. I cannot figure out what this is doing:
with codecs.open("huge_picklefile.pc", "rb") as f:
data = pickle.load(f)
print(len(data))
data = pickle.load(f)
print(len(data))
data = pickle.load(f)
print(len(data))
This returns to me:
335
59
12
I am beyond confused. I am use to pickle loading the massive file into memory. The object itself is a massive array of arrays (I assume). Could it be comprised of multiple pickle objects? Unfortunately, I didn't create the pickle object and I don't have access to who did.
I cannot figure out why pickle is splitting up my file into chunks, which isn't the default, and I am not telling it to. What does reloading the same file do? I honestly never tried or even came across a use case until now.
I spent a good 5 hours trying to figure out how to even ask this question on Google. Unsurprisingly, trying "multiple pickle loads on the same document" doesn't yield anything too useful. The Python 3.7 pickle docs does not describe this behavior. I can't figure out how repeatedly loading a pickle document doesn't (a) crash or (b) load the entire thing into memory and then just reference itself. In my 15 years of using python I have never run into this problem... so I am taking a leap of faith that this is just weird and we should probably just use a database instead.
This file is not quite a pickle file. Someone has dumped multiple pickles into the same file, resulting in the file contents being a concatenation of multiple pickles. When you call pickle.load(f), pickle will read the file from the current file position until it finds a pickle end, so each pickle.load call will load the next pickle.
You can create such a file yourself by calling pickle.dump repeatedly:
with open('demofile', 'wb') as f:
pickle.dump([1, 2, 3], f)
pickle.dump([10, 20], f)
pickle.dump([0, 0, 0], f)
Related
Background:
Hi. I'm currently working on a project that mainly relies on Pickles to save the state of objects I have. Here is a code snippet of two functions I've written:
from Kiosk import * #The main class used for the lockers
import gc #Garbage Collector library. Used to get all instances of a class
import pickle #Library used to store variables in files.
def storeData(Lockers:Locker):
with open('lockerData', 'wb') as File:
pickle.dump(Lockers, File)
def readData():
with open('lockerData', 'rb') as File:
return pickle.load(File)
This pickle data will eventually be sent and received from a server using the Sockets library.
I've done some reading on the topic of Pickles and it seems like everyone agrees that pickles can be quite dangerous to use in some use cases as it's relatively easy to get them to execute unwanted code.
Objective:
For the above mentioned reasons I want to encrypt my pickle data in AES before writing it to the Pickle File, that way the pickle file is always encrypted even when sent and received form the server. My main problem now is, I don't know how to get the pickle data without writing it to the Pickle file first. pickle.dump() only allows me to write the pickle data to a file but doesn't allow me to get this pickle data straight away.
If I decide to do encryption after the pickle data has already been written to the file that would mean that there would be a period of time where the pickle data is stored in plain text, and I don't want that to happen.
Psudocode:
Here is how I'm expecting the task execution to flow:
PickleData = createPicle(Lockers)
PickleDataE = encrypt(PickleData)
with open('textfile.txt', 'wb') as File:
File.write(PickleDataE)
Question:
So my question is, how can I get the pickle data without writing it to a file?
You can store the pickle data to file as the encrypted data itself. When you read the encrypted data, you decrypt it to a variable. If you wrap that variable in an io.StringIO object, you can read it just like you do from a file, except it's in memory now. IF you give that a try, i'm sure future questions can help with how to read the decrypted data as if it were pickle data.
I'm running 64-bit Python 3 on Linux, and I have a code that generates lists with about 20,000 elements. A memory error occurred when my code tried to write a list of ~20,000 2D arrays to a binary file via the pickle module, but it generated all of these arrays and appended them to this list without a problem. I know this must take up a lot of memory, but the machine I'm using has about 100GB available (from the command free -m). The line with the error:
with open('all_data.data', 'wb') as f:
pickle.dump(data, f)
>>> MemoryError
where data is my list of ~20,000 numpy arrays. Also, previously I was trying to run this code with about 55,000 elements, but while it was 40% of the way through with appending all the arrays to the data list, it just output Killed by itself. So now I'm trying to break it into segments, but this time I get a MemoryError. How can I bypass this? I was also informed that I have access to multiple CPUs, but I have no idea how to take advantage of these (I don't yet understand multiprocessing).
Pickle will try to parse all your data, and likely convert it to intermediate states before writing everything to disk - so if you are using about half your available memory, it will blow up.
Since your data is already on a list, an easy workaround there is to pickle each array, and store it, instead of trying to serialize the 20000 arrays in a single go:
with open('all_data.data', 'wb') as f:
for item in data:
pickle.dump(item, f)
Then, to read it back, just keep unpickling objects from your file and appending then to a list, until the file is exhausted:
data = []
with open('all_data.data', 'rb') as f:
while True:
try:
data.append(pickle.load(f))
except EOFError:
break
This works because unpicking from a file is quite well behaved: the file pointer stays exactly at the point a pickled object stored in the file ends - further reads therefore start at the beginning of the next object.
I have my pickle function working properly
with open(self._prepared_data_location_scalar, 'wb') as output:
# company1 = Company('banana', 40)
pickle.dump(X_scaler, output, pickle.HIGHEST_PROTOCOL)
pickle.dump(Y_scaler, output, pickle.HIGHEST_PROTOCOL)
with open(self._prepared_data_location_scalar, 'rb') as input_f:
X_scaler = pickle.load(input_f)
Y_scaler = pickle.load(input_f)
However, I am very curious how does pickle know which to load? Does it mean that everything has to be in the same sequence?
What you have is fine. It's a documented feature of pickle:
It is possible to make multiple calls to the dump() method of the same Pickler instance. These must then be matched to the same number of calls to the load() method of the corresponding Unpickler instance.
There is no magic here, pickle is a really simple stack-based language that serializes python objects into bytestrings. The pickle format knows about object boundaries: by design, pickle.dumps('x') + pickle.dumps('y') is not the same bytestring as pickle.dumps('xy').
If you're interested to learn some background on the implementation, this article is an easy read to shed some light on the python pickler.
wow I did not even know you could do this ... and I have been using python for a very long time... so thats totally awesome in my book, however you really should not do this it will be very hard to work with later(especially if it isnt you working on it)
I would recommend just doing
pickle.dump({"X":X_scalar,"Y":Y_scalar},output)
...
data = pickle.load(fp)
print "Y_scalar:",data['Y']
print "X_scalar:",data['X']
unless you have a very compelling reason to save and load the data like you were in your question ...
edit to answer the actual question...
it loads from the start of the file to the end (ie it loads them in the same order they were dumped)
Yes, pickle pick objects in order of saving.
Intuitively, pickle append to the end when it write (dump) to a file,
and read (load) sequentially the content from a file.
Consequently, order is preserved, allowing you to retrieve your data in the exact order you serialize it.
I just tried to update a program i wrote and i needed to add another pickle file. So i created the blank .pkl and then use this command to open it(just as i did with all my others):
with open('tryagain.pkl', 'r') as input:
self.open_multi_clock = pickle.load(input)
only this time around i keep getting this really weird error for no obvious reason,
cPickle.UnpicklingError: invalid load key, 'Γ'.
The pickle file does contain the necessary information to be loaded, it is an exact match to other blank .pkl's that i have and they load fine. I don't know what that last key is in the error but i suspect that could give me some incite if i know what it means.
So have have figured out the solution to this problem, and i thought I'd take the time to list some examples of what to do and what not to do when using pickle files. Firstly, the solution to this was to simply just make a plain old .txt file and dump the pickle data to it.
If you are under the impression that you have to actually make a new file and save it with a .pkl ending you would be wrong. I was creating my .pkl's with notepad++ and saving them as .pkl's. Now from my experience this does work sometimes and sometimes it doesn't, if your semi-new to programming this may cause a fair amount of confusion as it did for me. All that being said, i recommend just using plain old .txt files. It's the information stored inside the file not necessarily the extension that is important here.
#Notice file hasn't been pickled.
#What not to do. No need to name the file .pkl yourself.
with open('tryagain.pkl', 'r') as input:
self.open_multi_clock = pickle.load(input)
The proper way:
#Pickle your new file
with open(filename, 'wb') as output:
pickle.dump(obj, output, -1)
#Now open with the original .txt ext. DONT RENAME.
with open('tryagain.txt', 'r') as input:
self.open_multi_clock = pickle.load(input)
Gonna guess the pickled data is throwing off portability by the outputted characters. I'd suggest base64 encoding the pickled data before writing it to file. What what I ran:
import base64
import pickle
value_p = pickle.dumps("abdfg")
value_p_b64 = base64.b64encode(value_p)
f = file("output.pkl", "w+")
f.write(value_p_b64)
f.close()
for line in open("output.pkl", 'r'):
readable += pickle.loads(base64.b64decode(line))
>>> readable
'abdfg'
I have a problem writing a file with Pickle in Python
Here is my code:
test = "TEST"
f1 = open(path+filename, "wb", 0)
pickle.dump(test,f1,0)
f1.close()
return
This gives me the output in the .txt file as VTESTp0. I'm not sure why this is?
Shouldn't it just have been saved as TEST?
I'm very new to pickle and I didn't even know it existed until today so sorry if I'm asking a silly question.
No, pickle does not write strings just as strings. Pickle is a serialization protocol, it turns objects into strings of bytes so that you can later recreate them. The actual format depends on which version of the protocol you use, but you should really treat pickle data as an opaque type.
If you want to write the string "TEST" to the file, just write the string itself. Don't bother with pickle.
Think of pickling as saving binary data to disk. This is interesting if you have data structures in your program like a big dict or array, which took some time to create. You can save them to a file with pickle and read them in with pickle the next time your program runs, thus saving you the time it took to build the data structure. The downside is that other, non-Python programs will not be able to understand the pickle files.
As pickle is quite versatile you can of course also write simple text strings to a pickle file. But if you want to process them further, e.g. in a text editor or by another program, you need to store them verbatim, as Thomas Wouters suggests:
test = "TEST"
f1 = open(path+filename, "wb", 0)
f1.write(test)
f1.close()
return