Python: Emit some Utf-8 string to windows console [duplicate]

Python: Emit some Utf-8 string to windows console [duplicate] - python

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Python, Unicode, and the Windows console
I read some strings from file and when I try to print these utf-8 strings in windows console, I get error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)
I've tried to set console-encoding to utf-8 with "chcp 65001"
But than I than get this error message
LookupError: unknown encoding: cp65001

I recommend you to check similar questions on stackoverflow, there are many of them.
Anyway, you can do it this way:
read from file in any encoding (for example utf8) but decode strings to unicode
for windows console, output unicode strings. You don't need to encode in this special case. You don't need to set the console encoding, output text will be correctly encoded automatically.
For files, you need to use codecs module or to encode in proper encoding.

The print command tries to convert Unicode strings to the encoding supported by the console. Try:
>>> import sys
>>> sys.stdout.encoding
'cp852'
It shows you what encoding the console supports (what is told to Python to be supported). If the character cannot be converted to that encoding, there is no way to display it correctly.

Related

Python3 equivalent to Python2 open when encountering UnicodeDecodeErrors [duplicate]

This question already has answers here:
"for line in..." results in UnicodeDecodeError: 'utf-8' codec can't decode byte
(20 answers)
Closed last year.
I had a script I was trying to port from python2 to python3.
I did read through the porting documentation, https://docs.python.org/3/howto/pyporting.html.
My original python2 script used open('filename.txt). In porting to python3 I updated it to io.open('filename.txt'). Now when running the script as python2 or python3 with the same input files I get some errors like UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 793: invalid start byte.
Does python2 open have less strict error checking than io.open or does it use a different default encoding? Does python3 have an equivalent way to call io.open to match python2 built in open?
Currently I've started using f = io.open('filename.txt', mode='r', errors='replace') which works. And comparing output to the original python2 version, no important data was lost from the replace errors.

First, io.open is open; there's no need to stop using open directly.
The issue is that your call to open is assuming the file is UTF-8-encoded when it is not. You'll have to supply the correct encoding explicitly.
open('filename.txt', encoding='iso-8859') # for example
(Note that the default encoding is platform-specific, but your error indicates that you are, in fact, defaulting to UTF-8.)
In Python 2, no attempt was made to decode non-ASCII files; reading from a file returned a str value consisting of whatever bytes were actually stored in the file.
This is part of the overall shift in Python 3 from using the old str type as sometimes text, sometimes bytes, to using str exclusively for Unicode text and bytes for any particular encoding of the text.

Python 2.7 with pandas to_csv() gives UnicodeEncodeError: 'ascii' codec can't encode character u'\xc7' in position 4: ordinal not in range(128)

I am using Python 2.7, and to overcome UTF-8 issues, I am using pandas to_csv method. The issue is, I am still getting unicode errors, which I dont get when I run the script on my local laptop with python 3 (not an option for batch processing).
df = pd.DataFrame(stats_results)
df.to_csv('/home/mp9293q/python_scripts/stats_dim_registration_set_column_transpose.csv', quoting=csv.QUOTE_ALL, doublequote=True, index=False,
index_label=False, header=False, line_terminator='\n', encoding='utf-8');
Gives error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc7' in position 4: ordinal not in range(128)

I believe you might be having one of these two problems (or maybe both):-
As you mentioned in the comments, the file in which you are trying
to save Unicode Data, already exists. Then there are quite
likely chances that the destination file, may not have
UTF-8/16/32 as it encoding scheme.
By this I mean, when the file was originally created, it's encoding
scheme may not be UTF-8, it could possibly be ANSI. So,
check whether the destination file's encoding scheme is of UTF
family or not.
Encode the Unicode string to UTF-8,
before storing it in a file. By this I mean, any content that you
are trying to save to your destination file, if contains Unicode
text, then it should be first encoded.
Ex.
# A character which could not be encoded via 8 bit ASCII
Uni_str = u"Ç"
# Converting the unicode text, into UTF-8 format
Uni_str = Uni_str.encode("utf-8")
The above code works differently in python 2.x and 3.x, the reason
being that 2.x uses ASCII as default encoding scheme, and 3.x uses
UTF-8. Another difference between the two is how they treat a string
after passing it via encode().
In Python 2.x
type(u"Ç".encode("utf-8"))
Outputs
<type 'str'>
In Python 3.x
type(u"Ç".encode("utf-8"))
Outputs
<class 'bytes'>
As you can notice, in python 2.x the return type of encode() is
string, but in 3.x it is bytes.
So for your case, I would recommend you to encode each string value containing unicode data in your dataframe using encode() before trying to store it in the file.

UnicodeDecodeError error writing .xlsx file using xlsxwriter

I am trying to write about 1000 rows to a .xlsx file from my python application. The data is basically a combination of integers and strings. I am getting intermittent error while running wbook.close() command. The error is the following:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 15:
ordinal not in range(128)
My data does not have anything in unicode. I am wondering why the decoder is being at all. Has anyone noticed this problem?

0xc3 is "À". So what you need to do is change the encoding. Use the decode() method.
string.decode('utf-8')
Also depending on your needs and uses you could add
# -*- coding: utf-8 -*-
at the beginning of your script, but only if you are sure that the encoding will not interfere and break something else.

As Alex Hristov points out you have some non-ascii data in your code that needs to be encoded as UTF-8 for Excel.
See the following examples from the docs which each have instructions on handling UTF-8 with XlsxWriter in different scenarios:
Example: Simple Unicode with Python 2
Example: Simple Unicode with Python 3
Example: Unicode - Polish in UTF-8

Why does my Python program get UnicodeDecodeError in IntelliJ but is OK from the command line?

I have a simple program that loads a .json file which contains a funny character. The program (see below) runs fine in Terminal but gets this error in IntelliJ:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
2: ordinal not in range(128)
The crucial code is:
with open(jsonFileName) as f:
jsonData = json.load(f)
if I replace the open with:
with open(jsonFileName, encoding='utf-8') as f:
Then it works in both IntelliJ and Terminal. I'm still new to Python and the IntelliJ plugin, and I don't understand why they're different. I thought sys.path might be different, but the output makes me think that's not the cause. Could someone please explain? Thanks!
Versions:
OS: Mac OS X 10.7.4 (also tested on 10.6.8)
Python 3.2.3 (v3.2.3:3d0686d90f55, Apr 10 2012, 11:25:50) /Library/Frameworks/Python.framework/Versions/3.2/bin/python3.2
IntelliJ: 11.1.3 Ultimate
Files (2):
1. unicode-error-demo.py
#!/usr/bin/python
import json
from pprint import pprint as pp
import sys
def main():
if len(sys.argv) is not 2:
print(sys.argv[0], "takes one arg: a .json file")
return
jsonFileName = sys.argv[1]
print("sys.path:")
pp(sys.path)
print("processing", jsonFileName)
# with open(jsonFileName) as f: # OK in Terminal, but BUG in IntelliJ: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2: ordinal not in range(128)
with open(jsonFileName, encoding='utf-8') as f: # OK in both
jsonData = json.load(f)
pp(jsonData)
if __name__ == "__main__":
main()
2. encode-temp.json
["™"]

The JSON .load() function expects Unicode data, not raw bytes. Python automatically tries to decode the byte string to a Unicode string for you using a default codec (in your case ASCII), and fails. By opening the file with the UTF-8 codec, Python makes an explicit conversion for you. See the open() function, which states:
In text mode, if encoding is not specified the encoding used is platform dependent.
The encoding that would be used is determined as follows:
Try os.device_encoding() to see if there is a terminal encoding.
Use locale.getpreferredencoding() function, which depends on the environment you run your code in. The do_setlocale of that function is set to False.
Use 'ASCII' as a default if both methods have returned None.
This is all done in C, but it's python equivalent would be:
if encoding is None:
encoding = os.device_encoding()
if encoding is None:
encoding = locale.getpreferredencoding(False)
if encoding is None:
encoding = 'ASCII'
So when you run your program in a terminal, os.deviceencoding() returns 'UTF-8', but when running under IntelliJ there is no terminal, and if no locale is set either, python uses 'ASCII'.
The Python Unicode HOWTO tells you all about the difference between unicode strings and bytestrings, as well as encodings. Another essential article on the subject is Joel Spolsky's Absolute Minimum Unicode knowledge article.

Python 2.x has strings and unicode strings. The basic strings are encoded with ASCII. ASCII uses only 7 bits/char, which allow to encode 128 characters, while modern UTF-8 uses up to 4 bytes/char. UTF-8 is compatible with ASCII (so that any ASCII-encoded string is a valid UTF-8 string), but not the other way round.
Apparently, your file name contains non-ASCII characters. And python by default wants to read it in as simple ASCII-encoded string, spots a non-ASCII character (its first bit is not 0 as it's 0xe2) and says, 'ascii' codec can't decode byte 0xe2 in position 2: ordinal not in range(128).
Has nothing to do with python, but still my favourite tutorial about encodings:
http://hektor.umcs.lublin.pl/~mikosmul/computing/articles/linux-unicode.html

error :: UnicodeDecodeError

I am getting
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 104: ordinal not in range(128)
I am using intgereproparty, stringproparty, datetimeproparty

That's because 0xb0 (decimal 176) is not a valid character code in ASCII (which defines only values between 0 and 127).
Check where you got that string from and use the proper encoding.
If you need further help, post the code.

You are trying to put Unicode data (probably text with accents) into an ASCII string.
You can use Python's codecs module to open a text file with UTF-8 encoding and write the Unicode data to it.
The .encode method may also help (u"õ".encode('utf-8') for example)

Python defaults to ASCII encoding - if you are dealing with chars outside of the ASCII range, you need to specify that in your code.
One way to do this is setting the defining the encoding at the top of your code.
This snippet sets the encoding at the top of the file to encoding to Latin-1 (which includes 0xb0):
#!/usr/bin/python
# -*- coding: latin-1 -*-
import os, sys
...
See PEP for more info on encoding.

When I write my foreign language "flashcard" programs, I always use python 3.x as its native encoding is utf-8. You're encoding problems will generally be far less frequent.
If you're working on a program that many people will share, however, you may want to consider using encode and decode with python 2.x, but only when storing and retrieving data elements in persistent storage. encode your non-ASCII characters, silently manipulate hexadecimal representations of those unicode strings in memory, and save them as hexadecimal. Finally, use decode when fetching unicode strings from persistant storage, but for end user display only. This will eliminate the need to constantly encode and decode your strings in your program.
#jcoon also has a pretty standard response to this problem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Emit some Utf-8 string to windows console [duplicate] - python

Related

Python3 equivalent to Python2 open when encountering UnicodeDecodeErrors [duplicate]

Python 2.7 with pandas to_csv() gives UnicodeEncodeError: 'ascii' codec can't encode character u'\xc7' in position 4: ordinal not in range(128)

UnicodeDecodeError error writing .xlsx file using xlsxwriter

Why does my Python program get UnicodeDecodeError in IntelliJ but is OK from the command line?

error :: UnicodeDecodeError

Categories

Resources