how does pickle know which to pick? - python

I have my pickle function working properly
with open(self._prepared_data_location_scalar, 'wb') as output:
# company1 = Company('banana', 40)
pickle.dump(X_scaler, output, pickle.HIGHEST_PROTOCOL)
pickle.dump(Y_scaler, output, pickle.HIGHEST_PROTOCOL)
with open(self._prepared_data_location_scalar, 'rb') as input_f:
X_scaler = pickle.load(input_f)
Y_scaler = pickle.load(input_f)
However, I am very curious how does pickle know which to load? Does it mean that everything has to be in the same sequence?

What you have is fine. It's a documented feature of pickle:
It is possible to make multiple calls to the dump() method of the same Pickler instance. These must then be matched to the same number of calls to the load() method of the corresponding Unpickler instance.
There is no magic here, pickle is a really simple stack-based language that serializes python objects into bytestrings. The pickle format knows about object boundaries: by design, pickle.dumps('x') + pickle.dumps('y') is not the same bytestring as pickle.dumps('xy').
If you're interested to learn some background on the implementation, this article is an easy read to shed some light on the python pickler.

wow I did not even know you could do this ... and I have been using python for a very long time... so thats totally awesome in my book, however you really should not do this it will be very hard to work with later(especially if it isnt you working on it)
I would recommend just doing
pickle.dump({"X":X_scalar,"Y":Y_scalar},output)
...
data = pickle.load(fp)
print "Y_scalar:",data['Y']
print "X_scalar:",data['X']
unless you have a very compelling reason to save and load the data like you were in your question ...
edit to answer the actual question...
it loads from the start of the file to the end (ie it loads them in the same order they were dumped)

Yes, pickle pick objects in order of saving.
Intuitively, pickle append to the end when it write (dump) to a file,
and read (load) sequentially the content from a file.
Consequently, order is preserved, allowing you to retrieve your data in the exact order you serialize it.

Related

Confusion about with statement python

I have a question regarding the use of with statements in python, as given below:
with open(fname) as f:
np.save(f,MyData)
If I'm not mistaken this opens the file fname in a secure manner, such that if an exception occurs the file is closed properly. then it writes MyData to the file. But what I would do is simply:
np.save(fname,MyData)
This would result in the same, MyData gets written to fname. I'm not sure I understand correctly why the former is better. I don't understand how this one-liner could keep the file "open" after it has ran the line. Therefore I also don't see how this could create issues when my code would crash afterwards.
Maybe this is a stupid/basic question, but I always thought that a cleaner code is a nicer code, so not having the extra with-loop seems just better to me.
numpy.save() handles the opening and closing in its code, however if you supply a file descriptor, it'll leave it open because it assumes you want to do something else with the file and if it closes the file it'll break the functionality for you.
Try this:
f = open(<file>)
f.close()
f.read() # boom
See also the hasattr(file, "write") ("file" as in descriptor or "handle" from file, buffer, or other IO) check, that checks if it's an object with a write() method and judging by that Numpy only assumes it's true.
However NumPy doesn't guarantee misusing its API e.g. if you create a custom structure that's a buffer and doesn't include write(), it'll be treated as a path and thus crash in the open() call.

Difference between pickle and opening a file?

What is the difference between using the pickle library and using with open()?
Both have the same functionality where you read and write into the file and I don't see any differences between them.
And why do many people use pickle more than with open() if it is so seemingly similar?
Let me see if I can understand where the point of confusion is and give a useful explanation.
open is how you get a file object, which is the interface between your Python program and an actual file on disk. with is a tool used for ensuring that the file object is closed at the appropriate time.
The file object allows you to read and/or write the file, depending on how it was opened. The built-in way to do this is with the object's own functionality. This lets you write whatever data you want, at the expense that you are responsible for figuring out what that data should be; alternately, it lets you read the data and gives you both the power and responsibility that comes from interpreting that data.
The pickle library builds on top of that functionality, to use the file's contents to represent native Python objects. It does the interpretation (parsing) and data-figuring-out (formatting) work for you, accomplishing something that would be difficult by hand. The trade-off is that it works in a specific way, and is fit for only a specific purpose - you won't, for example, be producing or interpreting plain text files, or images, or JSON data, etc. this way any time soon (which you could by writing the data yourself, or by using a different, special-purpose library - except of course for plain text, where there's no point in doing anything beyond using the built-in functionality).
The difference is in what you put in the file, and who is responsible for the underlying file's format / serialization.
With the builtin open, you receive a raw file handle. You can write whatever you want to it. It doesn't have to be structured, it doesn't have to be consistent, hell it doesn't even have to make sense to an outside observer. You are allowed to write whatever you want to a file.
With a pickle, the underlying module is responsible for what is written. It serializes python objects (as much as possible, there are examples of classes that cannot be pickled) in a consistent, reproducible format that can the be re-loaded. IE - you can save the state of actual python objects in a static file, and then reload them and end up with identical objects the next the the interpreter runs. This has advantages when dealing with stateful programs.
Bonus: The shelve module serves as a user-friendly frontend to pickle that behaves like a dictionary. When you close the shelf, the contents are serialized to disk. When you re-open the shelf, the objects are deserialized from the file and accessible the same way a dictionary would be.
pickle allows you to conveniently write python objects to it, and load those objects. How would you use open() to write a dictionary into a file, and be able to load it into your python file with one simple line?
For open(), it will be like:
dct = {'a': 1,
'b': 2,
'c': 3,
'd': 4,
'e': 5}
with open('file.txt','w') as f:
f.write('\n'.join([f"{k}, {v}" for k, v in dct.items()]))
with open('file.txt','r') as f:
dct = {k: int(v) for k, v in [s.split(', ') for s in f.read().splitlines()]}
While with pickle:
import pickle
dct = {'a': 1,
'b': 2,
'c': 3,
'd': 4,
'e': 5}
with open('file.txt','wb') as f:
pickle.dump(dct, f)
with open('file.txt','rb') as f:
dct = pickle.load(f)
Note the conversion did in the first method, where we need to convert the string into an integer. With pickle, you won't have to worry about that.
pickle is a library function written using with open() to convert the objects into a byte stream to save developer efforts.
Further info: https://docs.python.org/3/library/pickle.html
and here is the code: https://github.com/python/cpython/blob/main/Lib/pickle.py

Writing a dictionary to a file and reading it back - Most efficient method [duplicate]

This question already has answers here:
Why is dumping with `pickle` much faster than `json`?
(3 answers)
Closed 3 years ago.
I wish to write to a text file with a dictionary. There are three methods that I've seen and it seems that they are all valid, but I am interested in which one will be most optimized or efficient for reading/writing, especially when I have a large dictionary with many entries and why.
new_dict = {}
new_dict["city"] = "Boston"
# Writing to the file by string conversion
with open(r'C:\\Users\xy243\Documents\pop.txt', 'w') as new_file:
new_file.write(str(new_dict))
# Writing to the file using Pickle
import pickle
with open(r'C:\\Users\xy243\Documents\pop.txt', 'w') as new_file:
pickle.dump(new_dict, new_file, protocol=pickle.HIGHEST_PROTOCOL)
# Writing to the file using JSON
import json
with open(r'C:\\Users\xy243\Documents\pop.txt', 'w') as new_file:
json.dump(new_dict, new_file)
The answers about efficiency have been pretty much been covered with the comments, however, it would probably be useful to you, if your dataset is large and you might want to replicate your approach, to consider SQL alternatives, made easier in python with SQLAlchemy. That way, you can access it quickly, but store it neatly in a database.
Objects of some python classes may not be json serializable. If your dictionary contains such objects (as values), then you can't use json object.
Sure, some objects of some python classes may not be pickle serializable (for example, keras/tensorflow objects). Then, again, you can't use pickle method.
In my opinion, classes which can't be json serialized are more than classes which can't be pickled.
That being said, pickle method may be applicable more widely than json.
Efficiency wise (considering your dictionary is both json serializable and pickle-able), pickle will always win because no string conversion is involved (number to string while serializing and string to number while deserializing).
If you are trying to transport the object to another process/server (written in another programming language especially ... Java etc.), then you have to live with json. This applies even if you write to file and another process read from that file.
So ... it depends on your use-case.

What is pickle doing?

I have used Python for years. I have used pickle extensively. I cannot figure out what this is doing:
with codecs.open("huge_picklefile.pc", "rb") as f:
data = pickle.load(f)
print(len(data))
data = pickle.load(f)
print(len(data))
data = pickle.load(f)
print(len(data))
This returns to me:
335
59
12
I am beyond confused. I am use to pickle loading the massive file into memory. The object itself is a massive array of arrays (I assume). Could it be comprised of multiple pickle objects? Unfortunately, I didn't create the pickle object and I don't have access to who did.
I cannot figure out why pickle is splitting up my file into chunks, which isn't the default, and I am not telling it to. What does reloading the same file do? I honestly never tried or even came across a use case until now.
I spent a good 5 hours trying to figure out how to even ask this question on Google. Unsurprisingly, trying "multiple pickle loads on the same document" doesn't yield anything too useful. The Python 3.7 pickle docs does not describe this behavior. I can't figure out how repeatedly loading a pickle document doesn't (a) crash or (b) load the entire thing into memory and then just reference itself. In my 15 years of using python I have never run into this problem... so I am taking a leap of faith that this is just weird and we should probably just use a database instead.
This file is not quite a pickle file. Someone has dumped multiple pickles into the same file, resulting in the file contents being a concatenation of multiple pickles. When you call pickle.load(f), pickle will read the file from the current file position until it finds a pickle end, so each pickle.load call will load the next pickle.
You can create such a file yourself by calling pickle.dump repeatedly:
with open('demofile', 'wb') as f:
pickle.dump([1, 2, 3], f)
pickle.dump([10, 20], f)
pickle.dump([0, 0, 0], f)

Pickle problem writing to file

I have a problem writing a file with Pickle in Python
Here is my code:
test = "TEST"
f1 = open(path+filename, "wb", 0)
pickle.dump(test,f1,0)
f1.close()
return
This gives me the output in the .txt file as VTESTp0. I'm not sure why this is?
Shouldn't it just have been saved as TEST?
I'm very new to pickle and I didn't even know it existed until today so sorry if I'm asking a silly question.
No, pickle does not write strings just as strings. Pickle is a serialization protocol, it turns objects into strings of bytes so that you can later recreate them. The actual format depends on which version of the protocol you use, but you should really treat pickle data as an opaque type.
If you want to write the string "TEST" to the file, just write the string itself. Don't bother with pickle.
Think of pickling as saving binary data to disk. This is interesting if you have data structures in your program like a big dict or array, which took some time to create. You can save them to a file with pickle and read them in with pickle the next time your program runs, thus saving you the time it took to build the data structure. The downside is that other, non-Python programs will not be able to understand the pickle files.
As pickle is quite versatile you can of course also write simple text strings to a pickle file. But if you want to process them further, e.g. in a text editor or by another program, you need to store them verbatim, as Thomas Wouters suggests:
test = "TEST"
f1 = open(path+filename, "wb", 0)
f1.write(test)
f1.close()
return

Categories

Resources