I'm trying to work with CIFAR-10 dataset which contains a special version for python.
It is a set of binary files, each representing a dictionary of 10k numpy matrices. The files were obviously created by python2 cPickle.
I tried to load it from python2 as follows:
import cPickle
with open("data/data_batch_1", "rb") as f:
data = cPickle.load(f)
This works really great. However, if I try to load the data from python3 (that hasn't cPickle but pickle instead), it fails:
import pickle
with open("data/data_batch_1", "rb") as f:
data = pickle.load(f)
If fails with the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 6: ordinal not in range(128)
Can I somehow transform the ofiginal dataset into new one that will be readable from python3? Or may I somehow read it from python3 direrctly?
I've tried loading it by cPickle, dumping it into json and reading it back by pickle, but numpy matrices obviously can't be written as a json file.
You'll need to tell pickle what codec to use for those bytestrings, or tell it to load the data as bytes instead. From the pickle.load() documentation:
The encoding and errors tell pickle how to decode 8-bit string instances pickled by Python 2; these default to ‘ASCII’ and ‘strict’, respectively. The encoding can be ‘bytes’ to read these 8-bit string instances as bytes objects.
To load the strings as bytes objects that'd be:
import pickle
with open("data/data_batch_1", "rb") as f:
data = pickle.load(f, encoding='bytes')
Related
Consider the following snippet:
import io
result = io.StringIO()
with open("file.pdf") as fp:
extract_text_to_fp(fp, result, output_type='xml')
data = result.getvalue()
This results in the following error
ValueError: Codec is required for a binary I/O output
If i leave out output_type i get the error
`UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 3804: character maps to <undefined>` instead.
I don't understand why this happens, and would like help with a workaround.
I figured out how to fix the problem:
First you need to open "file.pdf" in binary mode. Then, if you want to read to memory, use BytesIO instead of StringIO and decode that.
For example
import io
result = io.BytesIO()
with open("file.pdf", 'rb') as fp:
extract_text_to_fp(fp, result, output_type='xml')
data = result.getvalue().decode("utf-8")
I have a pickle file which contains floating-point values. This file was created with Python 2.7. In Python 2.7 I used to load it like:
matrix_file = pickle.load(open('matrix.pickle', 'r'))
Now in Python 3.8 this code is giving error
TypeError: a bytes-like object is required, not 'str'
When I trid with 'rb' I got this error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 8: ordinal not in range(128)
So I tried another method
matrix_file = pickle.load(open('matrix.pickle', 'r', encoding='utf-8'))
Now I get a different error
TypeError: a bytes-like object is required, not 'str'
Update: When I try loading with joblib, I get this error
ValueError: You may be trying to read with python 3 a joblib pickle generated with python 2. This feature is not supported by joblib.
The file must be opened in binary mode and you need to provide an encoding for the pickle.load call. Typically, the encoding should either be "latin-1" (for pickles with numpy arrays, datetime, date and time objects, or when the strings were logically Latin-1), or "bytes" (to decode Python 2 str as bytes objects). So the code should be something like:
with open('matrix.pickle', 'rb') as f:
matrix_file = pickle.load(f, encoding='latin-1')
This assumes it was originally containing numpy arrays; if not, "bytes" might be the more appropriate encoding. I also used a with statement just for good form (and to ensure deterministic file closing on non-CPython interpreters).
I have pickled a model in python 2.7 with following sentence
import pickle
with open('filename','w') as f:
pickle.dump(model, f)
How I am in python 3.X and what to unpickle the model, but get error
'utf-8' codec can't decode byte 0x86 in position 4: invalid start byte
The code I tried is:
import pickle
with open('filename','rb') as f:
model = pickle.load(f, encoding='UTF-8')
You pickle with 'w' but you unpickle with rb... So maybe that's the problem...
The other thing I found out: 0x86 can be decoded using latin-1. So maybe you can try to change this, or both.
I also read in the pickle docs that the pickle protocoll is automatically detected and should not cause the problem. So it seems to be all about encoding...
I'm wondering if there is a way to load an object that was pickled in Python 2.4, with Python 3.4.
I've been running 2to3 on a large amount of company legacy code to get it up to date.
Having done this, when running the file I get the following error:
File "H:\fixers - 3.4\addressfixer - 3.4\trunk\lib\address\address_generic.py"
, line 382, in read_ref_files
d = pickle.load(open(mshelffile, 'rb'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal
not in range(128)
Looking at the pickled object in contention, it's a dict in a dict, containing keys and values of type str.
So my question is: Is there a way to load an object, originally pickled in python 2.4, with python 3.4?
You'll have to tell pickle.load() how to convert Python bytestring data to Python 3 strings, or you can tell pickle to leave them as bytes.
The default is to try and decode all string data as ASCII, and that decoding fails. See the pickle.load() documentation:
Optional keyword arguments are fix_imports, encoding and errors, which are used to control compatibility support for pickle stream generated by Python 2. If fix_imports is true, pickle will try to map the old Python 2 names to the new names used in Python 3. The encoding and errors tell pickle how to decode 8-bit string instances pickled by Python 2; these default to ‘ASCII’ and ‘strict’, respectively. The encoding can be ‘bytes’ to read these 8-bit string instances as bytes objects.
Setting the encoding to latin1 allows you to import the data directly:
with open(mshelffile, 'rb') as f:
d = pickle.load(f, encoding='latin1')
but you'll need to verify that none of your strings are decoded using the wrong codec; Latin-1 works for any input as it maps the byte values 0-255 to the first 256 Unicode codepoints directly.
The alternative would be to load the data with encoding='bytes', and decode all bytes keys and values afterwards.
Note that up to Python versions before 3.6.8, 3.7.2 and 3.8.0, unpickling of Python 2 datetime object data is broken unless you use encoding='bytes'.
Using encoding='latin1' causes some issues when your object contains numpy arrays in it.
Using encoding='bytes' will be better.
Please see this answer for complete explanation of using encoding='bytes'
I know it is possible to encode a Python object to a file using
import pickle
pickle.dump(obj, file)
or you can do nearly the same using JSON, but the problem is, these all encode or decode to a file, is it possible to encode an object into a string or bytes variable instead of a file?
I am running Python 3.2 on windows.
Sure, just use pickle.dumps(obj) or json.dumps(obj).