noob queries on unicode and str methods in Python

noob queries on unicode and str methods in Python - python

I wish to seek some clarifications on Unicode and str methods in Python. After reading some explanation on Unicode, there are still couple of doubts I hope folks can help me on:
Am I right to say that when declaring a unicode string e.g word=u'foo', python uses the encoding of the terminal and decodes foo in e.g UTF-8, and assigning word the hex representation in unicode?
So, in general, is the process of printing out characters in a file, always decoding the byte stream according to the encoding to unicode representation, before displaying the mapped characters out?
In my terminal, Why does 'é'.lower() or str('é') displays in hex '\xc3\xa9', whereas 'a'.lower() does not?

First we should be clear we are talking about Python 2 only. Python 3 is different.
You're right. But if you write u"abcd" in a py file, the declaration of the encoding of the source file will determine how the interpreter decode you string.
You need to decode it first, and then encode it and print. In Python 2, DON'T print out unicode directly! Otherwise, if the system is encoding it in an incompatitable way (like "ascii"), an exception will be raised.
You have to do all these explicitly.
The short answer is "a" doesn't have to be represented in "\x61", "a" is simply more readable. A longer answer: typically in the interactive shell, if you type a value and press enter, Python will show the repr() of your string. I think "repr" will try to print everything in ascii representation. For "a", it's already ascii, so it's outputed directly. For str "é", it's UTF-8 encoded binary stream, so Python escape each byte and print as 'xc3\xa9'

I don't think Python does any automatic encoding or decoding on console I/O. Consider the following:
>>> 'é'
'\xc3\xa9'
>>> 'é'.decode('UTF-8')
u'\xe9'
You'll notice that \xe9 is the Unicode code point for 'LATIN SMALL LETTER E WITH ACUTE', while \xc3\xa9 is the byte sequence corresponding to the same character in UTF-8.
Everything changes in Python 3, since all strings are Unicode. I'm not sure of the rules there.

See http://www.python.org/dev/peps/pep-0263/ about how to specify encoding of Python source file. For Python interpreter there's PYTHONIOENCODING environment variable.
What OS do you use?

The statement word = u'foo' assigns a unicode string object, not a "hex representation". Unicode objects represent sequences of text characters. Also, it is wrong to think of decoding in this context. Unicode is not an encoding, nor does it "have" an encoding.
Yes. Decode In: Encode Out.
For the repr of a non-unicode string literal, Python will use sys.stdin.encoding; for the repr of a unicode string literal, Python will use "unicode_escape".

Related

Python – How do I convert an ASCII string into UTF-8?

I am using a package in python that returns a string using ASCII characters as opposed to unicode (eg. returns 'serÃ©' as opposed to seré).
Given this is python 3.8, the string is actually encoded in unicode, the package just seems to output it as if it were ASCII. As such, when I try to perform x.decode('utf-8') or x.encode('ascii'), neither work. Is there a way to make python treat the string as if it were ASCII, such that I can decode it to unicode? Or is there a package that can serve this purpose.
I am relatively new to python so I apologise if my explanation is unclear. I am happy to clarify things if needed.
Code
from spanishconjugator import Conjugator as c
verb = c().conjugate('pasar', 'preterite', 'indicative', 'yo')
print(verb)
This returns the string 'pasÃ©' where it should return 'pasé'.
Update
From further searching and from your answers, it appears to be an issue to do with single 2-byte UTF-8 (é) characters being literally interpreted as two 1-byte latin-1 (Ã©) characters (nothing to do with ASCII, my mistake).
Managed to fix it with:
verb.encode('latin-1').decode('utf-8')
Thank you to those that commented.

If the input string contains the raw byte ordinals (such as \xc3\xa9/Ã© instead of é) use latin1 to encode it to bytes verbatim, then decode with the desired encoding.
>>> "pasÃ©".encode('latin1').decode()
'pasé'

python3 shell mode can output an utf-8 character for some bytes and cannot for some other, what is the reason?

what i have already known:
b'\xce\xb8'.decode('UTF-8') gives 'θ', because decode() function is designed for doing this job - decoding the bytes.
what i want to know is, dose python3 shell mode have some default config to control following behavior (Python3) .
>>> sys.getdefaultencoding()
'utf-8'
>>> b'\xce\xb8'.decode()
'θ'
>>> b'\xce\xb8'
b'\xce\xb8'
>>> b'\x41'
b'A'
>>> print(b'\xce\xb6')
b'\xce\xb6'
>>> print(b'\xce\xb6'.decode('utf8'))
ζ
it seems like shell mode use ASCII as default encoding rather than utf8.
the question is, is this true? if yes, what the path where the config is located in?

This has nothing to do with the encoding. Python is just showing you in the shell what the value is that you just gave it, in a more literal sense. Try this instead:
a = b'\xce\xb8'
print(a)
result:
θ
So 'a' is indeed encoded as UTF-8, just as you expected. You're just misinterpreting what Python is echoing back to the console.
BTW, you're also I think not doing what you think you are with the 'b' prefix. It appears you're using Python 2.X. In that version of Python, the 'b' prefix is ignored. I know that because it doesn't show up in the echoed result. See here:
Python 2.x:
>>> b'\xce\xb8'
'\xce\xb8'
Python 3.X
>>> b'\xce\xb8'
b'\xce\xb8'
So in Python 2.X, you'll get the same result with and without the 'b'. In Python 3.X, you get different behavior either way than what you get in Python 2.X. I haven't done much with Python 3.X, but I believe that this is because how strings are represented changed in 3.X.
PS: If you really just care how Python is echoing strings back to you, I don't know that there's a way to change that. I wonder, however, why that matters to you.

Python 3 represents bytes as the equivalent ASCII character if the value of the byte is within the ASCII range, otherwise it displays the escaped hex value.
From the docs for the byte type:
Only ASCII characters are permitted in bytes literals (regardless of the declared source code encoding). Any binary values over 127 must be entered into bytes literals using the appropriate escape sequence.
This is a deliberate design decision (from the same doc)
to emphasise that while many binary formats include ASCII based elements and can be usefully manipulated with some text-oriented algorithms, this is not generally the case for arbitrary binary data
The interpreter doesn't display characters for bytes outside the ASCII range because it cannot know whether the bytes are encoded as UTF-8, some other encoding, or even if they represent text data at all.
As user Steve points out in their answer, this behaviour is not related to encoding. It is not configurable; if you want to see the characters corresponding to a UTF-8 encoded bytestring, decode to str.

when python interpreter loads source file, will it convert file content to unicode in memory?

Say, I have a source file encoded in utf8, when python interpreter loads that source file, will it convert file content to unicode in memory and then try to evaluate source code in unicode?
If I have a string with non ASCII char in it, like
astring = '中文'
and the file is encoded in gbk.
Running that file with python 2, I found that string actually is still in raw gbk bytes.
So I dboubt, python 2 interpret does not convert source code to unicode. Beacause if so, the string content will be in unicode(I heard it is actually UTF16)
Is that right? And if so, how about python 3 interpreter? Does it convert source code to unicode format?
Acutally, I know how to define unicode and raw string in both Python2 and 3.
I'm just curious about one detail when the interpreter loads source code.
Will it convert the WHOLE raw source code (encoded bytes) to unicode at very beginning and then try to interpret unicode format source code piece by piece?
Or instead, it just loads raw source piece by piece, and only decodes what it think should. For example, when it hits the statement u'中文' , OK, decode to unicode. While it hits statment b'中文', OK, no need to decode.
Which way the interpreter will go?

If your source file is encoded with GBK, put this line at the top of the file (first or second line):
# coding: gbk
This is required for both Python 2 and 3.
If you omit this encoding declaration, the interpreter will assume ASCII in the case of Python 2, and UTF-8 for Python 3.
The encoding declaration controls how the interpreter reads the bytes of the source file. This is mainly relevant for string literals (like in your example), but theoretically also applies to comments and even identifiers (it's probably not a good idea to use non-ASCII in identifiers, though).
As for the question whether you get byte strings or unicode strings: this depends on the syntax, not on the choice and declaration of encoding.
As pointed out in Ignacio's answer, if you want to have unicode strings in Python 2, you need to use the u'...' notation.
In Python 3, the u prefix is optional.
So, with a correct encoding declaration in the file header, it is sufficient to write astring = '中文' to get a correct unicode string in Python 3.
Update
By comment, the OP asks about the interpretation of b'中文'.
In Python 3, this isn't allowed (byte strings can only contain ASCII characters), but you can test this yourself in Python 2.x:
# coding: gbk
x = b'中文'
y = u'中文'
print repr(x)
print repr(y)
This will produce:
'\xd6\xd0\xce\xc4'
u'\u4e2d\u6587'
The first line reflects the actual bytes contained in the source file (if you saved it with GBK, of course).
So there seems to be no decoding happening for b'中文'.
However, I don't know how the interpreter internally represents the source code with respect to encoding (that seems to be your question).
This is implementation-dependent anyway, so the answer might even be different for cPython, Jython, IronPython etc.

So I dboubt, python 2 interpret does not convert source code to unicode.
It never does. If you want to use Unicode rather than bytes then you need to use a unicode instead.
astring = u'中文'

Python source is only plain ASCII, meaning that the actual encoding does not matter except for litteral strings, be them unicode strings or byte strings. Identifiers can use non ascii characters (IMHO it would be a very bad practice), but their meaning is normally internal to the Python interpreter, so the way it reads them is not really important
Byte strings are always left unchanged. That means that normal strings in Python 2 and byte litteral strings in Python 3 are never converted.
Unicode strings are always converted:
if the special string coding: charset_name exists in a comment on first or second line, the original byte string is converted as it would be with decode(charset_name)
if not encoding is specified, Python 2 will assume ASCII and Python 3 will assume utf8

How to print Unicode in Python 2 when LANG=C

Quite a silly question, I know. Of course normally LANG=C indicates an ASCII terminal
which cannot display Unicode characters. But I nevertheless want to print out the UTF-8 bytes. I use Python 2 (2.6.5 actually)
print '\xc3\xa4', u'\xe4'
This prints 'ä ä' on a Unicode terminal, but the second string causes an error when executed with LANG=C. I don't want Python to be smart but simply convert u'\xe4' to UTF-8 so it's just '\xc3\xa4' in memory.
I tried all combinations of decode(), encode() and unicode() that I can imagine but it seems I missed the right combination.
What I actually want is reading Unicode charaters through vi's system() function, like
:echo system('python foo.py')

To encode a unicode to utf-8, call .encode('utf-8') on it.:
>>> u'\xe4'.encode('utf-8')
'\xc3\xa4'

Python: what does "...".encode("utf8") fix?

I wanted to url encode a python string and got exceptions with hebrew strings.
I couldn't fix it and started doing some guess oriented programming.
Finally, doing mystr = mystr.encode("utf8") before sending it to the url encoder saved the day.
Can somebody explain what happened? What does .encode("utf8") do? My original string was a unicode string anyways (i.e. prefixed by a u).

My original string was a unicode string anyways (i.e. prefixed by a u)
...which is the problem. It wasn't a "string", as such, but a "Unicode object". It contains a sequence of Unicode code points. These code points must, of course, have some internal representation that Python knows about, but whatever that is is abstracted away and they're shown as those \uXXXX entities when you print repr(my_u_str).
To get a sequence of bytes that another program can understand, you need to take that sequence of Unicode code points and encode it. You need to decide on the encoding, because there are plenty to choose from. UTF8 and UTF16 are common choices. ASCII could be too, if it fits. u"abc".encode('ascii') works just fine.
Do my_u_str = u"\u2119ython" and then type(my_u_str) and type(my_u_str.encode('utf8')) to see the difference in types: The first is <type 'unicode'> and the second is <type 'str'>. (Under Python 2.5 and 2.6, anyway).
Things are different in Python 3, but since I rarely use it I'd be talking out of my hat if I tried to say anything authoritative about it.

You original string was a unicode object containing raw Unicode code points, after encoding it as UTF-8 it is a normal byte string that contains UTF-8 encoded data.
The URL encoder seems to expect a byte string, so that it can URL-encode one byte after another and doesn't have to deal with Unicode code points. When you give it a unicode object, it tries to convert it to a byte string using some default encoding, probably ASCII. For Hebrew characters that cannot be represented as ASCII, this will lead to errors.

What does .encode("utf8") do?
It depends on which version of Python you're using:
In Python 3.x, it converts a str object (encoded in UTF-16 or UTF-32) into a bytes object containing the UTF-8 representation of the string.
In Python 2.x, it converts a unicode object into a str object encoded in UTF-8. But str has an encode method too, and writing '...'.encode('UTF-8') is equivalent to writing '...'.decode('ascii').encode('UTF-8').
Since you mentioned the "u" prefix, you must be using 2.x. If you don't require any 2.x-only libraries, I'd recommend switching to 3.x, which has a nice clear distinction between text and binary data.
Dive into Python 3 has a good explanation of the issue.
Can somebody explain what happened?
It would help if you told us what the error message was.
The urllib.quote function expects a str object. It also happens to work with unicode objects that contain only ASCII characters, but not when they contain Hebrew letters.
In Python 3.x, urllib.parse.quote accepts both str (=Python 2.x unicode) and bytes objects. Strings are automatically encoded in UTF-8.

"...".encode("utf-8") transforms the in-memory representation of the string into an UTF-8 -encoded string.
url encoder likely expected a bytestring, that is, string representation where each character is represented with a single byte.

It returns a UTF-8 encoded version of the Unicode string, mystr. It is important to realize that UTF-8 is simply 1 way of encoding Unicode. Python can work with many other encodings (eg. mystr.encode("utf32") or even mystr.encode("ascii")).

The link that balpha posted explains it all. In short:
The fact that your string was prefixed with "u" just means it's composed of Unicode characters (or code points). UTF-8 is an encoding of this string into a sequence of bytes.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.