Unpickling a python 2 object with python 3 - python

I'm wondering if there is a way to load an object that was pickled in Python 2.4, with Python 3.4.
I've been running 2to3 on a large amount of company legacy code to get it up to date.
Having done this, when running the file I get the following error:
File "H:\fixers - 3.4\addressfixer - 3.4\trunk\lib\address\address_generic.py"
, line 382, in read_ref_files
d = pickle.load(open(mshelffile, 'rb'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal
not in range(128)
Looking at the pickled object in contention, it's a dict in a dict, containing keys and values of type str.
So my question is: Is there a way to load an object, originally pickled in python 2.4, with python 3.4?

You'll have to tell pickle.load() how to convert Python bytestring data to Python 3 strings, or you can tell pickle to leave them as bytes.
The default is to try and decode all string data as ASCII, and that decoding fails. See the pickle.load() documentation:
Optional keyword arguments are fix_imports, encoding and errors, which are used to control compatibility support for pickle stream generated by Python 2. If fix_imports is true, pickle will try to map the old Python 2 names to the new names used in Python 3. The encoding and errors tell pickle how to decode 8-bit string instances pickled by Python 2; these default to ‘ASCII’ and ‘strict’, respectively. The encoding can be ‘bytes’ to read these 8-bit string instances as bytes objects.
Setting the encoding to latin1 allows you to import the data directly:
with open(mshelffile, 'rb') as f:
d = pickle.load(f, encoding='latin1')
but you'll need to verify that none of your strings are decoded using the wrong codec; Latin-1 works for any input as it maps the byte values 0-255 to the first 256 Unicode codepoints directly.
The alternative would be to load the data with encoding='bytes', and decode all bytes keys and values afterwards.
Note that up to Python versions before 3.6.8, 3.7.2 and 3.8.0, unpickling of Python 2 datetime object data is broken unless you use encoding='bytes'.

Using encoding='latin1' causes some issues when your object contains numpy arrays in it.
Using encoding='bytes' will be better.
Please see this answer for complete explanation of using encoding='bytes'

Related

Python3 equivalent to Python2 open when encountering UnicodeDecodeErrors [duplicate]

This question already has answers here:
"for line in..." results in UnicodeDecodeError: 'utf-8' codec can't decode byte
(20 answers)
Closed last year.
I had a script I was trying to port from python2 to python3.
I did read through the porting documentation, https://docs.python.org/3/howto/pyporting.html.
My original python2 script used open('filename.txt). In porting to python3 I updated it to io.open('filename.txt'). Now when running the script as python2 or python3 with the same input files I get some errors like UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 793: invalid start byte.
Does python2 open have less strict error checking than io.open or does it use a different default encoding? Does python3 have an equivalent way to call io.open to match python2 built in open?
Currently I've started using f = io.open('filename.txt', mode='r', errors='replace') which works. And comparing output to the original python2 version, no important data was lost from the replace errors.
First, io.open is open; there's no need to stop using open directly.
The issue is that your call to open is assuming the file is UTF-8-encoded when it is not. You'll have to supply the correct encoding explicitly.
open('filename.txt', encoding='iso-8859') # for example
(Note that the default encoding is platform-specific, but your error indicates that you are, in fact, defaulting to UTF-8.)
In Python 2, no attempt was made to decode non-ASCII files; reading from a file returned a str value consisting of whatever bytes were actually stored in the file.
This is part of the overall shift in Python 3 from using the old str type as sometimes text, sometimes bytes, to using str exclusively for Unicode text and bytes for any particular encoding of the text.

Error loading pickle file created with Python 2.7 in Python 3.8

I have a pickle file which contains floating-point values. This file was created with Python 2.7. In Python 2.7 I used to load it like:
matrix_file = pickle.load(open('matrix.pickle', 'r'))
Now in Python 3.8 this code is giving error
TypeError: a bytes-like object is required, not 'str'
When I trid with 'rb' I got this error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 8: ordinal not in range(128)
So I tried another method
matrix_file = pickle.load(open('matrix.pickle', 'r', encoding='utf-8'))
Now I get a different error
TypeError: a bytes-like object is required, not 'str'
Update: When I try loading with joblib, I get this error
ValueError: You may be trying to read with python 3 a joblib pickle generated with python 2. This feature is not supported by joblib.
The file must be opened in binary mode and you need to provide an encoding for the pickle.load call. Typically, the encoding should either be "latin-1" (for pickles with numpy arrays, datetime, date and time objects, or when the strings were logically Latin-1), or "bytes" (to decode Python 2 str as bytes objects). So the code should be something like:
with open('matrix.pickle', 'rb') as f:
matrix_file = pickle.load(f, encoding='latin-1')
This assumes it was originally containing numpy arrays; if not, "bytes" might be the more appropriate encoding. I also used a with statement just for good form (and to ensure deterministic file closing on non-CPython interpreters).

Python 2.7 with pandas to_csv() gives UnicodeEncodeError: 'ascii' codec can't encode character u'\xc7' in position 4: ordinal not in range(128)

I am using Python 2.7, and to overcome UTF-8 issues, I am using pandas to_csv method. The issue is, I am still getting unicode errors, which I dont get when I run the script on my local laptop with python 3 (not an option for batch processing).
df = pd.DataFrame(stats_results)
df.to_csv('/home/mp9293q/python_scripts/stats_dim_registration_set_column_transpose.csv', quoting=csv.QUOTE_ALL, doublequote=True, index=False,
index_label=False, header=False, line_terminator='\n', encoding='utf-8');
Gives error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc7' in position 4: ordinal not in range(128)
I believe you might be having one of these two problems (or maybe both):-
As you mentioned in the comments, the file in which you are trying
to save Unicode Data, already exists. Then there are quite
likely chances that the destination file, may not have
UTF-8/16/32 as it encoding scheme.
By this I mean, when the file was originally created, it's encoding
scheme may not be UTF-8, it could possibly be ANSI. So,
check whether the destination file's encoding scheme is of UTF
family or not.
Encode the Unicode string to UTF-8,
before storing it in a file. By this I mean, any content that you
are trying to save to your destination file, if contains Unicode
text, then it should be first encoded.
Ex.
# A character which could not be encoded via 8 bit ASCII
Uni_str = u"Ç"
# Converting the unicode text, into UTF-8 format
Uni_str = Uni_str.encode("utf-8")
The above code works differently in python 2.x and 3.x, the reason
being that 2.x uses ASCII as default encoding scheme, and 3.x uses
UTF-8. Another difference between the two is how they treat a string
after passing it via encode().
In Python 2.x
type(u"Ç".encode("utf-8"))
Outputs
<type 'str'>
In Python 3.x
type(u"Ç".encode("utf-8"))
Outputs
<class 'bytes'>
As you can notice, in python 2.x the return type of encode() is
string, but in 3.x it is bytes.
So for your case, I would recommend you to encode each string value containing unicode data in your dataframe using encode() before trying to store it in the file.

UnicodeDecodeError when using python 2.7 code on python 3.7 with cPickle

I am trying to use cPickle on a .pkl file constructed from a "parsed" .csv file. The parsing is undertaken using a pre-constructed python toolbox, which has recently been ported to python 3 from python 2 (https://github.com/GEMScienceTools/gmpe-smtk)
The code I'm using is as follows:
from smtk.parsers.esm_flatfile_parser import ESMFlatfileParser
parser=ESMFlatfileParser.autobuild("Database10","Metadata10","C:/Python37/TestX10","C:/Python37/NorthSea_Inc_SA.csv")
import cPickle
sm_database = cPickle.load(open("C:/Python37/TestX10/metadatafile.pkl","r"))
It returns the following error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 44: character maps to <undefined>
From what I can gather, I need to specify the encoding of my .pkl file to enable cPickle to work but I do not know what the encoding is on the file produced from the parsing of the .csv file, so I can't use cPickle to currently do so.
I used the sublime text software to find it is "hexadecimal", but this is not an accepted encoding format in Python 3.7 is it not?
If anyone knows how to determine the encoding format required, or how to make hexadecimal encoding usable in Python 3.7 their help would be much appreciated.
P.s. the modules used such as "ESMFlatfileparser" are part of a pre-constructed toolbox. Considering this, is there a chance I may need to alter the encoding in some way within this module also?
The code is opening the file in text mode ('r'), but it should be binary mode ('rb').
From the documentation for pickle.load (emphasis mine):
[The] file can be an on-disk file opened for binary reading, an io.BytesIO object, or any other custom object that meets this interface.
Since the file is being opened in binary mode there is no need to provide an encoding argument to open. It may be necessary to provide an encoding argument to pickle.load. From the same documentation:
Optional keyword arguments are fix_imports, encoding and errors, which are used to control compatibility support for pickle stream generated by Python 2. If fix_imports is true, pickle will try to map the old Python 2 names to the new names used in Python 3. The encoding and errors tell pickle how to decode 8-bit string instances pickled by Python 2; these default to ‘ASCII’ and ‘strict’, respectively. The encoding can be ‘bytes’ to read these 8-bit string instances as bytes objects. Using encoding='latin1' is required for unpickling NumPy arrays and instances of datetime, date and time pickled by Python 2.
This ought to prevent the UnicodeDecodeError:
sm_database = cPickle.load(open("C:/Python37/TestX10/metadatafile.pkl","rb"))

Encode with JSON or pickle to a Variable Using Python

I know it is possible to encode a Python object to a file using
import pickle
pickle.dump(obj, file)
or you can do nearly the same using JSON, but the problem is, these all encode or decode to a file, is it possible to encode an object into a string or bytes variable instead of a file?
I am running Python 3.2 on windows.
Sure, just use pickle.dumps(obj) or json.dumps(obj).

Categories

Resources