Python3 equivalent to Python2 open when encountering UnicodeDecodeErrors [duplicate]

Python3 equivalent to Python2 open when encountering UnicodeDecodeErrors [duplicate] - python

This question already has answers here:
"for line in..." results in UnicodeDecodeError: 'utf-8' codec can't decode byte
(20 answers)
Closed last year.
I had a script I was trying to port from python2 to python3.
I did read through the porting documentation, https://docs.python.org/3/howto/pyporting.html.
My original python2 script used open('filename.txt). In porting to python3 I updated it to io.open('filename.txt'). Now when running the script as python2 or python3 with the same input files I get some errors like UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 793: invalid start byte.
Does python2 open have less strict error checking than io.open or does it use a different default encoding? Does python3 have an equivalent way to call io.open to match python2 built in open?
Currently I've started using f = io.open('filename.txt', mode='r', errors='replace') which works. And comparing output to the original python2 version, no important data was lost from the replace errors.

First, io.open is open; there's no need to stop using open directly.
The issue is that your call to open is assuming the file is UTF-8-encoded when it is not. You'll have to supply the correct encoding explicitly.
open('filename.txt', encoding='iso-8859') # for example
(Note that the default encoding is platform-specific, but your error indicates that you are, in fact, defaulting to UTF-8.)
In Python 2, no attempt was made to decode non-ASCII files; reading from a file returned a str value consisting of whatever bytes were actually stored in the file.
This is part of the overall shift in Python 3 from using the old str type as sometimes text, sometimes bytes, to using str exclusively for Unicode text and bytes for any particular encoding of the text.

Related

UnicodeEncodeError in python3 when redirection is used

What I want to do: extract text information from a pdf file and redirect that to a txt file.
What I did:
pip install pdfminor
pdf2txt.py file.pdf > output.txt
What I got:
UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 0: illegal multibyte sequence
My observation:
\u2022 is bullet point, •.
pdf2txt.py works well without redirection: the bullet point character is written to stdout without any error.
My question:
Why does redirection cause a python error? As far as I know, redirection is a O.S. job, and it is simply copying things after the program is finished.
How can I fix this error? I cannot do any modification to pdf2txt.py as it's not my code.

Redirection causes an error because the default encoding used by Python does not support one of the characters you're trying to output. In your case you're trying to output the bullet character • using the GBK codec. This probably means you're using a Chinese version of Windows.
A version of Python 3.6 or later will work fine outputting to the terminal window on Windows, because character encoding is bypassed completely using Unicode. It's only when redirecting the output to a file that the Unicode must be encoded to a byte stream.
You can set the environment variable PYTHONIOENCODING to change the encoding used for stdio. If you use UTF-8 it will be guaranteed to work with any Unicode character.
set PYTHONIOENCODING=utf-8
pdf2txt.py file.pdf > output.txt

You seem to have somehow obtained unicode characters from the raw bytes but you need to encode it. I recommend you to use UTF-8 encoding for txt files.
Making the encoding parameter more explicit is probably what you want.
def gbk_to_utf8(source, target):
with open(source, "r", encoding="gbk") as src:
with open(target, "w", encoding="utf-8") as dst:
for line in src.readlines():
dst.write(line)

Python 2.7 with pandas to_csv() gives UnicodeEncodeError: 'ascii' codec can't encode character u'\xc7' in position 4: ordinal not in range(128)

I am using Python 2.7, and to overcome UTF-8 issues, I am using pandas to_csv method. The issue is, I am still getting unicode errors, which I dont get when I run the script on my local laptop with python 3 (not an option for batch processing).
df = pd.DataFrame(stats_results)
df.to_csv('/home/mp9293q/python_scripts/stats_dim_registration_set_column_transpose.csv', quoting=csv.QUOTE_ALL, doublequote=True, index=False,
index_label=False, header=False, line_terminator='\n', encoding='utf-8');
Gives error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc7' in position 4: ordinal not in range(128)

I believe you might be having one of these two problems (or maybe both):-
As you mentioned in the comments, the file in which you are trying
to save Unicode Data, already exists. Then there are quite
likely chances that the destination file, may not have
UTF-8/16/32 as it encoding scheme.
By this I mean, when the file was originally created, it's encoding
scheme may not be UTF-8, it could possibly be ANSI. So,
check whether the destination file's encoding scheme is of UTF
family or not.
Encode the Unicode string to UTF-8,
before storing it in a file. By this I mean, any content that you
are trying to save to your destination file, if contains Unicode
text, then it should be first encoded.
Ex.
# A character which could not be encoded via 8 bit ASCII
Uni_str = u"Ç"
# Converting the unicode text, into UTF-8 format
Uni_str = Uni_str.encode("utf-8")
The above code works differently in python 2.x and 3.x, the reason
being that 2.x uses ASCII as default encoding scheme, and 3.x uses
UTF-8. Another difference between the two is how they treat a string
after passing it via encode().
In Python 2.x
type(u"Ç".encode("utf-8"))
Outputs
<type 'str'>
In Python 3.x
type(u"Ç".encode("utf-8"))
Outputs
<class 'bytes'>
As you can notice, in python 2.x the return type of encode() is
string, but in 3.x it is bytes.
So for your case, I would recommend you to encode each string value containing unicode data in your dataframe using encode() before trying to store it in the file.

Unpickling a python 2 object with python 3

I'm wondering if there is a way to load an object that was pickled in Python 2.4, with Python 3.4.
I've been running 2to3 on a large amount of company legacy code to get it up to date.
Having done this, when running the file I get the following error:
File "H:\fixers - 3.4\addressfixer - 3.4\trunk\lib\address\address_generic.py"
, line 382, in read_ref_files
d = pickle.load(open(mshelffile, 'rb'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal
not in range(128)
Looking at the pickled object in contention, it's a dict in a dict, containing keys and values of type str.
So my question is: Is there a way to load an object, originally pickled in python 2.4, with python 3.4?

You'll have to tell pickle.load() how to convert Python bytestring data to Python 3 strings, or you can tell pickle to leave them as bytes.
The default is to try and decode all string data as ASCII, and that decoding fails. See the pickle.load() documentation:
Optional keyword arguments are fix_imports, encoding and errors, which are used to control compatibility support for pickle stream generated by Python 2. If fix_imports is true, pickle will try to map the old Python 2 names to the new names used in Python 3. The encoding and errors tell pickle how to decode 8-bit string instances pickled by Python 2; these default to ‘ASCII’ and ‘strict’, respectively. The encoding can be ‘bytes’ to read these 8-bit string instances as bytes objects.
Setting the encoding to latin1 allows you to import the data directly:
with open(mshelffile, 'rb') as f:
d = pickle.load(f, encoding='latin1')
but you'll need to verify that none of your strings are decoded using the wrong codec; Latin-1 works for any input as it maps the byte values 0-255 to the first 256 Unicode codepoints directly.
The alternative would be to load the data with encoding='bytes', and decode all bytes keys and values afterwards.
Note that up to Python versions before 3.6.8, 3.7.2 and 3.8.0, unpickling of Python 2 datetime object data is broken unless you use encoding='bytes'.

Using encoding='latin1' causes some issues when your object contains numpy arrays in it.
Using encoding='bytes' will be better.
Please see this answer for complete explanation of using encoding='bytes'

Python write (iPhone) Emoji to a file

I have been trying to write a simple script that can save user input (originating from an iPhone) to a text file. The issue I'm having is that when a user uses an Emoji icon, it breaks the whole thing.
OS: Ubuntu
Python Version: 2.7.3
My code currently looks like this
f = codecs.open(path, "w+", encoding="utf8")
f.write("Desc: " + json_obj["description"])
f.close()
When an Emoji character is passed in the description variable, I get the error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-8: ordinal not in range(128)
Any possible help is appreciated.

The most likely problem here is that json_obj["description"] is actually a UTF-8-encoded str, not a unicode. So, when you try to write it to a codecs-wrapped file, Python has to decode it from str to unicode so it can re-encode it. And that's the part that fails, because that automatic decoding uses sys.getdefaultencoding(), which is 'ascii'.
For example:
>>> f = codecs.open('emoji.txt', 'w+', encoding='utf-8')
>>> e = u'\U0001f1ef'
>>> print e
🇯
>>> e
u'\U0001f1ef'
>>> f.write(e)
>>> e8 = e.encode('utf-8')
>>> e8
'\xf0\x9f\x87\xaf'
>>> f.write(e8)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 0: ordinal not in range(128)
There are two possible solutions here.
First, you can explicitly decode everything to unicode as early as possible. I'm not sure where your json_obj is coming from, but I suspect it's not actually the stdlib json.loads, because by default, that always gives you unicode keys and values. So, replacing whatever you're using for JSON with the stdlib functions will probably solve the problem.
Second, you can leave everything as UTF-8 str objects and stay in binary mode. If you know you have UTF-8 everywhere, just open the file instead of codecs.open, and write without any encoding.
Also, you should strongly consider using io.open instead of codecs.open. It has a number of advantages, including:
Raises an exception instead of doing the wrong thing if you pass it incorrect values.
Often faster.
Forward-compatible with Python 3.
Has a number of bug fixes that will never be back-ported to codecs.
The only disadvantage is that it's not backward compatible to Python 2.5. Unless that matters to you, don't use codecs.

Python: Emit some Utf-8 string to windows console [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Python, Unicode, and the Windows console
I read some strings from file and when I try to print these utf-8 strings in windows console, I get error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)
I've tried to set console-encoding to utf-8 with "chcp 65001"
But than I than get this error message
LookupError: unknown encoding: cp65001

I recommend you to check similar questions on stackoverflow, there are many of them.
Anyway, you can do it this way:
read from file in any encoding (for example utf8) but decode strings to unicode
for windows console, output unicode strings. You don't need to encode in this special case. You don't need to set the console encoding, output text will be correctly encoded automatically.
For files, you need to use codecs module or to encode in proper encoding.

The print command tries to convert Unicode strings to the encoding supported by the console. Try:
>>> import sys
>>> sys.stdout.encoding
'cp852'
It shows you what encoding the console supports (what is told to Python to be supported). If the character cannot be converted to that encoding, there is no way to display it correctly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python3 equivalent to Python2 open when encountering UnicodeDecodeErrors [duplicate] - python

Related

UnicodeEncodeError in python3 when redirection is used

Python 2.7 with pandas to_csv() gives UnicodeEncodeError: 'ascii' codec can't encode character u'\xc7' in position 4: ordinal not in range(128)

Unpickling a python 2 object with python 3

Python write (iPhone) Emoji to a file

Python: Emit some Utf-8 string to windows console [duplicate]

Categories

Resources