Python-unexpected behavior when I don't decode to utf-8

Python-unexpected behavior when I don't decode to utf-8 - python

I have the following function
import urllib.request
def seek():
web = urllib.request.urlopen("http://wecloudforyou.com/")
text = web.read().decode("utf8")
return text
texto = seek()
print(texto)
When I decode to utf-8, I get the html code with indentation and carriage returns and all, just like it's seen on the actual website.
<!DOCTYPE html>
<html>
<head>
<title>We Cloud for You |
If I remove .decode('utf8'), I get the code, but the indentation is gone and it's replaced by \n.
<!DOCTYPE html>\n<html>\n <head>\n <title>We Cloud for You
So, why is this happening? As far as I know, when you decode, you are basically converting some encoded string into Unicode.
My sys.stdout.encoding is CP1252 (Windows 1252 encoding)
According to this thread: Why does Python print unicode characters when the default encoding is ASCII?
Python outputs non-unicode strings as raw data, without considering
its default encoding. The terminal just happens to display them if its
current encoding matches the data. - Python outputs Unicode strings
after encoding them using the scheme specified in sys.stdout.encoding.
- Python gets that setting from the shell's environment. - the terminal displays output according to its own encoding settings. - the
terminal's encoding is independant from the shell's.
So, it seems like python needs to read the text in Unicode before it can convert it to CP1252 and then it's printed on the terminal. But I don't understand why if the text is not decoded, it replaces the indentation with \n.
sys.getdefaultencoding() returns utf8.

In Python 3, when you pass a byte value (raw bytes from the network without decoding) you get to see the representation of the byte value as a Python byte literal. This includes representing newlines as \n characters.
By decoding, you now have a unicode string value instead, and print() can handle that directly:
>>> print(b'Newline\nAnother line')
b'Newline\nAnother line'
>>> print(b'Newline\nAnother line'.decode('utf8'))
Newline
Another line
This is perfectly normal behaviour.

Related

when python interpreter loads source file, will it convert file content to unicode in memory?

Say, I have a source file encoded in utf8, when python interpreter loads that source file, will it convert file content to unicode in memory and then try to evaluate source code in unicode?
If I have a string with non ASCII char in it, like
astring = '中文'
and the file is encoded in gbk.
Running that file with python 2, I found that string actually is still in raw gbk bytes.
So I dboubt, python 2 interpret does not convert source code to unicode. Beacause if so, the string content will be in unicode(I heard it is actually UTF16)
Is that right? And if so, how about python 3 interpreter? Does it convert source code to unicode format?
Acutally, I know how to define unicode and raw string in both Python2 and 3.
I'm just curious about one detail when the interpreter loads source code.
Will it convert the WHOLE raw source code (encoded bytes) to unicode at very beginning and then try to interpret unicode format source code piece by piece?
Or instead, it just loads raw source piece by piece, and only decodes what it think should. For example, when it hits the statement u'中文' , OK, decode to unicode. While it hits statment b'中文', OK, no need to decode.
Which way the interpreter will go?

If your source file is encoded with GBK, put this line at the top of the file (first or second line):
# coding: gbk
This is required for both Python 2 and 3.
If you omit this encoding declaration, the interpreter will assume ASCII in the case of Python 2, and UTF-8 for Python 3.
The encoding declaration controls how the interpreter reads the bytes of the source file. This is mainly relevant for string literals (like in your example), but theoretically also applies to comments and even identifiers (it's probably not a good idea to use non-ASCII in identifiers, though).
As for the question whether you get byte strings or unicode strings: this depends on the syntax, not on the choice and declaration of encoding.
As pointed out in Ignacio's answer, if you want to have unicode strings in Python 2, you need to use the u'...' notation.
In Python 3, the u prefix is optional.
So, with a correct encoding declaration in the file header, it is sufficient to write astring = '中文' to get a correct unicode string in Python 3.
Update
By comment, the OP asks about the interpretation of b'中文'.
In Python 3, this isn't allowed (byte strings can only contain ASCII characters), but you can test this yourself in Python 2.x:
# coding: gbk
x = b'中文'
y = u'中文'
print repr(x)
print repr(y)
This will produce:
'\xd6\xd0\xce\xc4'
u'\u4e2d\u6587'
The first line reflects the actual bytes contained in the source file (if you saved it with GBK, of course).
So there seems to be no decoding happening for b'中文'.
However, I don't know how the interpreter internally represents the source code with respect to encoding (that seems to be your question).
This is implementation-dependent anyway, so the answer might even be different for cPython, Jython, IronPython etc.

So I dboubt, python 2 interpret does not convert source code to unicode.
It never does. If you want to use Unicode rather than bytes then you need to use a unicode instead.
astring = u'中文'

Python source is only plain ASCII, meaning that the actual encoding does not matter except for litteral strings, be them unicode strings or byte strings. Identifiers can use non ascii characters (IMHO it would be a very bad practice), but their meaning is normally internal to the Python interpreter, so the way it reads them is not really important
Byte strings are always left unchanged. That means that normal strings in Python 2 and byte litteral strings in Python 3 are never converted.
Unicode strings are always converted:
if the special string coding: charset_name exists in a comment on first or second line, the original byte string is converted as it would be with decode(charset_name)
if not encoding is specified, Python 2 will assume ASCII and Python 3 will assume utf8

Python unicode force convert to ascii (str)

when using post in django, a ascii string will be automatic transfer into unicode string.
for example:
s = '\xe2\x80\x99'
is a str type string. (Which is utf-8 format)
when post this string to django, and then get it from request.POST, it is transferred to unicode string:
u'\xe2\x80\x99'
this may cause decode/encode error, because python thought it was a unicode string, but it is a utf-8 string in fact.
My question is how to FORCE convert unicode string to ascii string? Which means just remove the pre 'u' from u'\xe2\x80\x99' to '\xe2\x80\x99'. The traditional method like decode and encode may not work in this situation.

When receiving the request, the encoding of the response is mis-declared as (probably) iso-8859-1, or perhaps not declared at all and defaulting to that encoding. The web site should declare its encoding correctly with a header:
<headers>
<meta http-equiv="content-type" content="text/html;charset=UTF-8">
</headers>
But if that isn't under your control, you can undo the encoding and decode it correctly:
>>> s = u'\xe2\x80\x99'
>>> s.encode('iso-8859-1')
'\xe2\x80\x99'
>>> s.encode('iso-8859-1').decode('utf8')
u'\u2019'

Python, Encoding output to UTF-8

I have a definition that builds a string composed of UTF-8 encoded characters. The output files are opened using 'w+', "utf-8" arguments.
However, when I try to x.write(string) I get the UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 1: ordinal not in range(128)
I assume this is because normally for example you would do `print(u'something'). But I need to use a variable and the quotations in u'_' negate that...
Any suggestions?
EDIT: Actual code here:
source = codecs.open("actionbreak/" + target + '.csv','r', "utf-8")
outTarget = codecs.open("actionbreak/" + newTarget, 'w+', "utf-8")
x = str(actionT(splitList[0], splitList[1]))
outTarget.write(x)
Essentially all this is supposed to be doing is building me a large amount of strings that look similar to this:
[日木曜 Deliverables]= CASE WHEN things = 11
THEN C ELSE 0 END

Are you using codecs.open()? Python 2.7's built-in open() does not support a specific encoding, meaning you have to manually encode non-ascii strings (as others have noted), but codecs.open() does support that and would probably be easier to drop in than manually encoding all the strings.
As you are actually using codecs.open(), going by your added code, and after a bit of looking things up myself, I suggest attempting to open the input and/or output file with encoding "utf-8-sig", which will automatically handle the BOM for UTF-8 (see http://docs.python.org/2/library/codecs.html#encodings-and-unicode, near the bottom of the section) I would think that would only matter for the input file, but if none of those combinations (utf-8-sig/utf-8, utf-8/utf-8-sig, utf-8-sig/utf-8-sig) work, then I believe the most likely situation would be that your input file is encoded in a different Unicode format with BOM, as Python's default UTF-8 codec interprets BOMs as regular characters so the input would not have an issue but output could.
Just noticed this, but... when you use codecs.open(), it expects a Unicode string, not an encoded one; try x = unicode(actionT(splitList[0], splitList[1])).
Your error can also occur when attempting to decode a unicode string (see http://wiki.python.org/moin/UnicodeEncodeError), but I don't think that should be happening unless actionT() or your list-splitting does something to the Unicode strings that causes them to be treated as non-Unicode strings.

In python 2.x there are two types of string: byte string and unicode string. First one contains bytes and last one - unicode code points. It is easy to determine, what type of string it is - unicode string starts with u:
# byte string
>>> 'abc'
'abc'
# unicode string:
>>> u'abc абв'
u'abc \u0430\u0431\u0432'
'abc' chars are the same, because the are in ASCII range. \u0430 is a unicode code point, it is out of ASCII range. "Code point" is python internal representation of unicode points, they can't be saved to file. It is needed to encode them to bytes first. Here how encoded unicode string looks like (as it is encoded, it becomes a byte string):
>>> s = u'abc абв'
>>> s.encode('utf8')
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
This encoded string now can be written to file:
>>> s = u'abc абв'
>>> with open('text.txt', 'w+') as f:
... f.write(s.encode('utf8'))
Now, it is important to remember, what encoding we used when writing to file. Because to be able to read the data, we need to decode the content. Here what data looks like without decoding:
>>> with open('text.txt', 'r') as f:
... content = f.read()
>>> content
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
You see, we've got encoded bytes, exactly the same as in s.encode('utf8'). To decode it is needed to provide coding name:
>>> content.decode('utf8')
u'abc \u0430\u0431\u0432'
After decode, we've got back our unicode string with unicode code points.
>>> print content.decode('utf8')
abc абв

xgord is right, but for further edification it's worth noting exactly what \ufeff means. It's known as a BOM or a byte order mark and basically it's a callback to the early days of unicode when people couldn't agree which way they wanted their unicode to go. Now all unicode documents are prefaced with either an \ufeff or an \uffef depending on which order they decide to arrange their bytes in.
If you hit an error on those characters in the first location you can be sure the issue is that you are not trying to decode it as utf-8, and the file is probably still fine.

Chinese and Japanese character support in python

How to read correctly japanese and chinese characters.
I'm using python 2.5. Output is displayed as "E:\Test\?????????"
path = r"E:\Test\は最高のプログラマ"
t = path.encode()
print t
u = path.decode()
print u
t = path.encode("utf-8")
print t
t = path.decode("utf-8")
print t

Please do read the Python Unicode HOWTO; it explains how to process and include non-ASCII text in your Python code.
If you want to include Japanese text literals in your code, you have several options:
Use unicode literals (create unicode objects instead of byte strings), but any non-ascii codepoint is represented by a unicode escape character. They take the form of \uabcd, so a backslash, a u and 4 hexadecimal digits:
ru = u'\u30EB'
would be one character, the katakana 'ru' codepoint ('ル').
Use unicode literals, but include the characters in some form of encoding. Your text editor will save files in a given encoding (say, UTF-16); you need to declare that encoding at the top of the source file:
# encoding: utf-16
ru = u'ル'
where 'ル' is included without using an escape. The default encoding for Python 2 files is ASCII, so by declaring an encoding you make it possible to use Japanese directly.
Use byte string literals, ready encoded. Encode the codepoints by some other means and include them in your byte string literals. If all you are going to do with them is use them in encoded form anyway, this should be fine:
ru = '\xeb\x30' # ru encoded to UTF16 little-endian
I encoded 'ル' to UTF-16 little-endian because that's the default Windows NTFS filename encoding.
Next problem will be your terminal, the Windows console is notorious for not supporting many character sets out of the box. You probably want to configure it to handle UTF-8 instead. See this question for some details, but you need to run the following command in the console:
chcp 65001
to switch to UTF-8, and you may need to switch to a console font that can handle your codepoints (Lucida perhaps?).

There are two independent issues:
You should specify Python source encoding if you use non-ascii characters and use Unicode literals for data that represents text e.g.:
# -*- coding: utf-8 -*-
path = ur"E:\Test\は最高のプログラマ"
Printing Unicode to Windows console is complicated but if you set correct font then just:
print path
might work.
Regardless of whether your console can display the path; it should be fine to pass the Unicode path to filesystem functions e.g.:
entries = os.listdir(path)
Don't call .encode(char_enc) on bytestrings, call it on Unicode strings instead.
Don't call .decode(char_enc) on Unicode strings, call it on bytestrings instead.

You should force the string to be a unicode object like
path = ur"E:\Test\は最高のプログラマ"
Docs on string literals relevant to 2.5 are located here
Edit: I'm not positive on if the object is a unicode in 2.5 but the docs do state that \uXXXX[XXXX] will be processed and the the string will be "a Unicode string".

noob queries on unicode and str methods in Python

I wish to seek some clarifications on Unicode and str methods in Python. After reading some explanation on Unicode, there are still couple of doubts I hope folks can help me on:
Am I right to say that when declaring a unicode string e.g word=u'foo', python uses the encoding of the terminal and decodes foo in e.g UTF-8, and assigning word the hex representation in unicode?
So, in general, is the process of printing out characters in a file, always decoding the byte stream according to the encoding to unicode representation, before displaying the mapped characters out?
In my terminal, Why does 'é'.lower() or str('é') displays in hex '\xc3\xa9', whereas 'a'.lower() does not?

First we should be clear we are talking about Python 2 only. Python 3 is different.
You're right. But if you write u"abcd" in a py file, the declaration of the encoding of the source file will determine how the interpreter decode you string.
You need to decode it first, and then encode it and print. In Python 2, DON'T print out unicode directly! Otherwise, if the system is encoding it in an incompatitable way (like "ascii"), an exception will be raised.
You have to do all these explicitly.
The short answer is "a" doesn't have to be represented in "\x61", "a" is simply more readable. A longer answer: typically in the interactive shell, if you type a value and press enter, Python will show the repr() of your string. I think "repr" will try to print everything in ascii representation. For "a", it's already ascii, so it's outputed directly. For str "é", it's UTF-8 encoded binary stream, so Python escape each byte and print as 'xc3\xa9'

I don't think Python does any automatic encoding or decoding on console I/O. Consider the following:
>>> 'é'
'\xc3\xa9'
>>> 'é'.decode('UTF-8')
u'\xe9'
You'll notice that \xe9 is the Unicode code point for 'LATIN SMALL LETTER E WITH ACUTE', while \xc3\xa9 is the byte sequence corresponding to the same character in UTF-8.
Everything changes in Python 3, since all strings are Unicode. I'm not sure of the rules there.

See http://www.python.org/dev/peps/pep-0263/ about how to specify encoding of Python source file. For Python interpreter there's PYTHONIOENCODING environment variable.
What OS do you use?

The statement word = u'foo' assigns a unicode string object, not a "hex representation". Unicode objects represent sequences of text characters. Also, it is wrong to think of decoding in this context. Unicode is not an encoding, nor does it "have" an encoding.
Yes. Decode In: Encode Out.
For the repr of a non-unicode string literal, Python will use sys.stdin.encoding; for the repr of a unicode string literal, Python will use "unicode_escape".

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.