(unicode error) 'unicodeescape' codec can't decode bytes - string with '\u' - python

Writing my code for Python 2.6, but with Python 3 in mind, I thought it was a good idea to put
from __future__ import unicode_literals
at the top of some modules. In other words, I am asking for troubles (to avoid them in the future), but I might be missing some important knowledge here. I want to be able to pass a string representing a filepath and instantiate an object as simple as
MyObject('H:\unittests')
In Python 2.6, this works just fine, no need to use double backslashes or a raw string, even for a directory starting with '\u..', which is exactly what I want. In the __init__ method I make sure all single \ occurences are interpreted as '\\', including those before special characters as in \a, \b, \f,\n, \r, \t and \v (only \x remains a problem). Also decoding the given string into unicode using (local) encoding works as expected.
Preparing for Python 3.x, simulating my actual problem in an editor (starting with a clean console in Python 2.6), the following happens:
>>> '\u'
'\\u'
>>> r'\u'
'\\u'
(OK until here: '\u' is encoded by the console using the local encoding)
>>> from __future__ import unicode_literals
>>> '\u'
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: end of string in escape sequence
In other words, the (unicode) string is not interpreted as unicode at all, nor does it get decoded automatically with the local encoding. Even so for a raw string:
>>> r'\u'
SyntaxError: (unicode error) 'rawunicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX
same for u'\u':
>>> u'\u'
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: end of string in escape sequence
Also, I would expect isinstance(str(''), unicode) to return True (which it does not), because importing unicode_literals should make all string-types unicode. (edit:) Because in Python 3, all strings are sequences of Unicode characters, I would expect str('')) to return such a unicode-string, and type(str('')) to be both <type 'unicode'>, and <type 'str'> (because all strings are unicode) but also realise that <type 'unicode'> is not <type 'str'>. Confusion all around...
Questions
how can I best pass strings containing '\u'? (without writing '\\u')
does from __future__ import unicode_literals really implement all Python 3. related unicode changes so that I get a complete Python 3 string environment?
edit:
In Python 3, <type 'str'> is a Unicode object and <type 'unicode'> simply does not exist. In my case I want to write code for Python 2(.6) that will work in Python 3. But when I import unicode_literals, I cannot check if a string is of <type 'unicode'> because:
I assume unicode is not part of the namespace
if unicode is part of the namespace, a literal of <type 'str'> is still unicode when it is created in the same module
type(mystring) will always return <type 'str'> for unicode literals in Python 3
My modules use to be encoded in 'utf-8' by a # coding: UTF-8 comment at the top, while my locale.getdefaultlocale()[1] returns 'cp1252'. So if I call MyObject('çça') from my console, it is encoded as 'cp1252' in Python 2, and in 'utf-8' when calling MyObject('çça') from the module. In Python 3, it will not be encoded, but a unicode literal.
edit:
I gave up hope about being allowed to avoid using '\' before a u (or x for that matter). Also I understand the limitations of importing unicode_literals. However, the many possible combinations of passing a string from a module to the console and vica versa with each different encoding, and on top of that importing unicode_literals or not and Python 2 vs Python 3, made me want to create an overview by actual testing. Hence the table below.
In other words, type(str('')) does not return <type 'str'> in Python 3, but <class 'str'>, and all of Python 2 problems seem to be avoided.

AFAIK, all that from __future__ import unicode_literals does is to make all string literals of unicode type, instead of string type. That is:
>>> type('')
<type 'str'>
>>> from __future__ import unicode_literals
>>> type('')
<type 'unicode'>
But str and unicode are still different types, and they behave just like before.
>>> type(str(''))
<type 'str'>
Always, is of str type.
About your r'\u' issue, it is by design, as it is equivalent to ru'\u' without unicode_literals. From the docs:
When an 'r' or 'R' prefix is used in conjunction with a 'u' or 'U' prefix, then the \uXXXX and \UXXXXXXXX escape sequences are processed while all other backslashes are left in the string.
Probably from the way the lexical analyzer worked in the python2 series. In python3 it works as you (and I) would expect.
You can type the backslash twice, and then the \u will not be interpreted, but you'll get two backslashes!
Backslashes can be escaped with a preceding backslash; however, both remain in the string
>>> ur'\\u'
u'\\\\u'
So IMHO, you have two simple options:
Do not use raw strings, and escape your backslashes (compatible with python3):
'H:\\unittests'
Be too smart and take advantage of unicode codepoints (not compatible with python3):
r'H:\u005cunittests'

For me this issue related to version not up to date, in this case numpy
To fix :
conda install -f numpy

I try this on Python 3:
import os
os.path.abspath("yourPath")
and it's worked!

When you're writing string literals which contain backslashes, such as paths (on Windows) or regexes, use raw strings. That's what they're for.

Related

Use unicode strings instead of regular strings? (Python 2.7)

As far as I know there is a difference between strings and unicode strings in Python. But is it possible to instruct Python to use unicode strings instead of regular ones whenever a string object is created?
So when I get a text input, I don't need to use unicode()?
I might sound lazy but I am just interested if this is possible...
p.s. I don't know a lot about character encoding so please correct me if I got anything wrong
For Example(In pyhon interactive,diff in GUI Shell) :
>>> s = '你好'
>>> s
'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> us = u'你好'
>>> us
u'\u4f60\u597d'
>>> print type(s)
<type 'str'>
>>> print type(us)
<type 'unicode'>
>>> len(s)
6
>>> len(us)
2
In short:
First, a string object is a sequence of characters,a Unicode string is a sequence of code points(Unicode code units), which are numbers from 0 to 0x10ffff.
Them, len(string) will reture a set of bytes,len(unicode) will return a number of characters.This sequence needs to be represented as a set of bytes (meaning, values from 0-255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an encoding.
I think you should use raw_input to instead input, if you want to get bytestring.
But is it possible to instruct Python to use unicode strings instead of regular ones whenever a string object is created?
There are two type of strings in Python (on both Python 2 and 3): a bytestring (a sequence of bytes) and a Unicode string (a sequence of Unicode codepoints).
bytestring = b'abc'
unicode_text = u'abc'
The type of string created using 'abc' string literal depends on Python version and the presence of from __future__ import unicode_literals import. Without the import on Python 2, 'abc' literal creates a bytestring otherwise it creates a Unicode string.
Add the encoding declaration at the top of your Python source file if you use non-ascii characters in string literals e.g.: # -*- coding: utf-8 -*-.
So when I get a text input, I don't need to use unicode()?
If by "text input" you mean that your program receives bytes somehow (from a file, network, from the command-line) then no: you shouldn't rely on Python to convert bytes to Unicode implicitly -- you should do it explicitly as soon as you receive the bytes using unicode_text = bytestring.decode(character_encoding).
And in reverse, keep the text as Unicode inside your program. Convert Unicode strings to bytes as late as possible when it is necessary (e.g., to send the text via the network).
Use bytestrings to work with a binary data: an image, a compressed content, etc. Use Unicode strings to work with text in Python.
To read Unicode from a file, use io.open() (you have to know the correct character encoding if it is not locale.getpreferredencoding(False)).
What character encoding to use when you receive your Unicode text via network may depend on the corresponding protocol e.g., the charset can be specified in Content-Type http header:
text = data.decode(response.headers.getparam('charset'))
You could use universal_newlines=True or io.TextIOWrapper() to get Unicode text from an external process started using subprocess module. It can be non-trivial to figure out what character encoding should be used on Windows (if you read Russian, see the gory details here: Byte при печати вывода внешней команды).
In Python 2.6+ you can use from __future__ import unicode_literals, but that only makes string literals Unicode. All functions that returned byte strings still return byte strings.
Example:
>>> s = 'abc'
>>> type(s)
<type 'str'>
>>> from __future__ import unicode_literals
>>> s = 'abc'
>>> type(s)
<type 'unicode'>
For the behavior you want, use Python 3.

Python UnicodeDecodeError when writing German letters

I've been banging my head on this error for some time now and I can't seem to find a solution anywhere on SO, even though there are similar questions.
Here's my code:
f = codecs.open(path, "a", encoding="utf-8")
value = "Bitte überprüfen"
f.write(("\"%s\" = \"%s\";\n" % ("no_internet", value)).encode("utf-8"))
And what I get as en error is:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128)
Why ascii if I say utf-8? I would really appreciate any help.
Try:
value = u"Bitte überprüfen"
in order to declare value as a unicode string and
# -*- coding: utf-8 -*-
at the start of your file in order to declare that your python file is saved with utf-8 encoding.
For the sake of never being hurt by unicode errors ever again, switch to python3:
% python3
>>> with open('/tmp/foo', 'w') as f:
... value = "Bitte überprüfen"
... f.write(('"{}" = "{}";\n'.format('no_internet', value)))
...
36
>>> import sys
>>> sys.exit(0)
% cat /tmp/foo
"no_internet" = "Bitte überprüfen";
though if you're really tied to python2 and have no choice:
% python2
>>> with open('/tmp/foo2', 'w') as f:
... value = u"Bitte überprüfen"
... f.write(('"{}" = "{}";\n'.format('no_internet', value.encode('utf-8'))))
...
>>> import sys
>>> sys.exit(0)
% cat /tmp/foo2
"no_internet" = "Bitte überprüfen";
And as #JuniorCompressor suggests, don't forget to add # encoding: utf-8 at the start of your python2 file to tell python to read the source file in unicode, not in ASCII!
Your error in:
f.write(("\"%s\" = \"%s\";\n" % ("no_internet", value)).encode("utf-8"))
is that you're encoding the whole formatted string into utf-8, whereas you shall encode the value string into utf-8 before doing the format:
>>> with open('/tmp/foo2', 'w') as f:
... value = u"Bitte überprüfen"
... f.write(('"{}" = "{}";\n'.format('no_internet', value).encode('utf-8')))
...
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 6: ordinal not in range(128)
Which is because python needs to first decode the string into utf-8, so you have to use the unicode type (which is what u"" does). Then you need to explicitly decode that value as unicode before feeding it to the format parser, to build the new string.
As Karl says in his answer, Python2 is totally messy/buggy when using unicode strings, defeating the Explicit is better than implicit zen of python. And for more weird behaviour, the following works just fine in python2:
>>> value = "Bitte überprüfen"
>>> out = '"{}" = "{}";\n'.format('no_internet', value)
>>> out
'"no_internet" = "Bitte \xc3\xbcberpr\xc3\xbcfen";\n'
>>> print(out)
"no_internet" = "Bitte überprüfen";
Still not convinced to switch to python3 ? :-)
Update:
This is the way to go to read and write an unicode string from a file to another file:
% echo "Bitte überprüfen" > /tmp/foobar
% python2
>>> with open('/tmp/foobar', 'r') as f:
... data = f.read().decode('utf-8').strip()
...
>>>
>>> with open('/tmp/foo2', 'w') as f:
... f.write(('"{}" = "{}";\n'.format('no_internet', data.encode('utf-8'))))
...
>>> import sys;sys.exit(0)
% cat /tmp/foo2
"no_internet" = "Bitte überprüfen";
Update:
as a general rule:
when you get a DecodeError you shall use the .decode('utf-8') on the string that contains unicode data and
when you get an EncodeError, you shall use the .encode('utf-8') on the string that contains unicode data
Update: if you cannot update to python3, you can at least make your python2 behave like it is almost python3, using the following python-future import statement:
from __future__ import absolute_import, division, print_function, unicode_literals
HTH
Like already suggested your error results from this line:
f.write(("\"%s\" = \"%s\";\n" % ("no_internet", value)).encode("utf-8"))
it should be:
f.write(('"{}" = "{}";\n'.format('no_internet', value.encode('utf-8'))))
A note on unicode and encodings
If woking with Python 2, software should only work with unicode strings internally, converting to a particular encoding on output.
Do prevent from making the same error over and over again you should make sure you understood the difference between ascii and utf-8 encodings and also between str and unicode objects in Python.
The difference between ASCII and UTF-8 encoding:
Ascii needs just one byte to represent all possible characters in the ascii charset/encoding. UTF-8 needs up to four bytes to represent the complete charset.
ascii (default)
1 If the code point is < 128, each byte is the same as the value of the code point.
2 If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)
utf-8 (unicode transformation format)
1 If the code point is <128, it’s represented by the corresponding byte value.
2 If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.
3 Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.
The difference between str and unicode objects:
You can say that str is baiscally a byte string and unicode is a unicode string. Both can have a different encoding like ascii or utf-8.
str vs. unicode
1 str = byte string (8-bit) - uses \x and two digits
2 unicode = unicode string - uses \u and four digits
3 basestring
/\
/ \
str unicode
If you follow some simple rules you should go fine with handling str/unicode objects in different encodings like ascii or utf-8 or whatever encoding you have to use:
Rules
1 encode(): Gets you from Unicode -> bytes
encode([encoding], [errors='strict']), returns an 8-bit string version of the Unicode string,
2 decode(): Gets you from bytes -> Unicode
decode([encoding], [errors]) method that interprets the 8-bit string using the given encoding
3 codecs.open(encoding=”utf-8″): Read and write files directly to/from Unicode (you can use any encoding, not just utf-8, but utf-8 is most common).
4 u”: Makes your string literals into Unicode objects rather than byte sequences.
5 unicode(string[, encoding, errors])
Warning: Don’t use encode() on bytes or decode() on Unicode objects
And again: Software should only work with Unicode strings internally, converting to a particular encoding on output.
Why ascii if I say utf-8?
Because in Python 2, "Bitte überprüfen" is not a Unicode string. Before it can be .encoded by your explicit call, Python must implicitly decode it to Unicode (This is also why it raises a UnicodeDecodeError), and it chooses ASCII because it has no other information to work with. The ü is represented with some byte with value >= 128, so it's not valid ASCII.
The u prefix shown by #JuniorCompressor will make it a Unicode string, and you should specify the encoding for the file as well (don't just blindly set utf-8; it needs to match whatever your text editor saves the .py file with!).
Switching to Python 3 is realistically (part of) a better long-term solution :) but it is still essential to understand the problem. See http://bit.ly/unipain for more details. The Python 2 behaviour is really a bug, or at least a failure to meet Pythonic design principles: Explicit is better than implicit, and here we see why very clearly ;)

Converting utf-8 encoded string to just plain text in python 3

So I've been getting all caught up in unicode and utf-8 as i have a script which grabs images and their titles off the web. Works great, except when their title has special characters (eg. Jökulsárlón.)
it comes out as unicode :-
J\\xc3\\xb6kuls\\xc3\\xa1rl\\xc3\\xb3n
So i want a way to turn that string into plain text- whether is turning them into nearest 'normal' letters (like plain o instead of ö) or printing those actual symbols (rather than \xc3 etc.) I've tried a billion different ways, but a lot of the things i've been reading havent worked for me in python 3.
Thanks in advance
It's indeed UTF-8 but they're bytes:
>>> b = b'J\xc3\xb6kuls\xc3\xa1rl\xc3\xb3n'
>>> b
b'J\xc3\xb6kuls\xc3\xa1rl\xc3\xb3n'
>>> b.decode('utf-8')
'Jökulsárlón'
As this is Python 3.x, this is a Unicode string.
J\xc3\xb6kuls\xc3\xa1rl\xc3\xb3n is not unicode. It may be UTF-8 though.
To turn them into Unicode you have to decode them. s.decode('utf-8') if it were UTF-8, for example.
Before printing or writing you have to encode them again. If you encode to ASCII, the encode method accepts an option that tells it what to do with code points that cannot be represented in the given encoding.
For example: print(s.encode('ascii', errors='ignore')
errors accepts more options.
If your string is <class 'str'> and it prints literally J\\xc3\\xb6kuls\\xc3\\xa1rl\\xc3\\xb3n, then the last line below will decode it:
>>> s='J\\xc3\\xb6kuls\\xc3\\xa1rl\\xc3\\xb3n'
>>> type(s)
<class 'str'>
>>> s
'J\\xc3\\xb6kuls\\xc3\\xa1rl\\xc3\\xb3n'
>>> s.encode('latin1').decode('unicode_escape').encode('latin1').decode('utf8')
'Jökulsárlón'
How it got that convoluted is unknown. If this isn't the solution, then update your question with the type of the variable holding the string (type(s) for example) and the exact value as shown above for my example.

List of unicode strings

If I have a list of unicode strings
lst = [ u"aaa", u"bbb", u"foo", u"bar", ... u"baz", u"zzz" ]
is it necessary to write a prefix u before every string? Can I make a construction that says that every element of lst will be unicode string and then write it without u prefix?
In Python 2.7 (also Python 2.6) you can make unicode literals the default for a module:
from __future__ import unicode_literals
You must include the import at the top of the file, and it then applies to all string literals in the file. Use a b prefix to force byte strings:
>>> from __future__ import unicode_literals
>>> "sss"
u'sss'
>>> b"x"
'x'
If your intention is to convert a set of standard strings to unicode, you could map that function onto your list:
lst = ["aaa", "bbb", "ccc"]
map(unicode, lst)
Which gives
[u"aaa", u"bbb", u"ccc"]
If however lst contains a non ASCII character string, you'll have to prefix that particular string with the u. If you don't, you'll get this error on the conversion:
lst = ["\xe4"]
map(unicode,lst)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
As noted in the comments, this answer is different for Python 2.x or 3.x. In Python 3, everything changes:
Everything you thought you knew about binary data and Unicode has changed. Python 3.0 uses the concepts of text and (binary) data instead of Unicode strings and 8-bit strings. All text is Unicode; however encoded Unicode is represented as binary data. The type used to hold text is str, the type used to hold data is bytes. The biggest difference with the 2.x situation is that any attempt to mix text and data in Python 3.0 raises TypeError, whereas if you were to mix Unicode and 8-bit strings in Python 2.x, it would work if the 8-bit string happened to contain only 7-bit (ASCII) bytes, but you would get UnicodeDecodeError if it contained non-ASCII values. This value-specific behavior has caused numerous sad faces over the years.

noob queries on unicode and str methods in Python

I wish to seek some clarifications on Unicode and str methods in Python. After reading some explanation on Unicode, there are still couple of doubts I hope folks can help me on:
Am I right to say that when declaring a unicode string e.g word=u'foo', python uses the encoding of the terminal and decodes foo in e.g UTF-8, and assigning word the hex representation in unicode?
So, in general, is the process of printing out characters in a file, always decoding the byte stream according to the encoding to unicode representation, before displaying the mapped characters out?
In my terminal, Why does 'é'.lower() or str('é') displays in hex '\xc3\xa9', whereas 'a'.lower() does not?
First we should be clear we are talking about Python 2 only. Python 3 is different.
You're right. But if you write u"abcd" in a py file, the declaration of the encoding of the source file will determine how the interpreter decode you string.
You need to decode it first, and then encode it and print. In Python 2, DON'T print out unicode directly! Otherwise, if the system is encoding it in an incompatitable way (like "ascii"), an exception will be raised.
You have to do all these explicitly.
The short answer is "a" doesn't have to be represented in "\x61", "a" is simply more readable. A longer answer: typically in the interactive shell, if you type a value and press enter, Python will show the repr() of your string. I think "repr" will try to print everything in ascii representation. For "a", it's already ascii, so it's outputed directly. For str "é", it's UTF-8 encoded binary stream, so Python escape each byte and print as 'xc3\xa9'
I don't think Python does any automatic encoding or decoding on console I/O. Consider the following:
>>> 'é'
'\xc3\xa9'
>>> 'é'.decode('UTF-8')
u'\xe9'
You'll notice that \xe9 is the Unicode code point for 'LATIN SMALL LETTER E WITH ACUTE', while \xc3\xa9 is the byte sequence corresponding to the same character in UTF-8.
Everything changes in Python 3, since all strings are Unicode. I'm not sure of the rules there.
See http://www.python.org/dev/peps/pep-0263/ about how to specify encoding of Python source file. For Python interpreter there's PYTHONIOENCODING environment variable.
What OS do you use?
The statement word = u'foo' assigns a unicode string object, not a "hex representation". Unicode objects represent sequences of text characters. Also, it is wrong to think of decoding in this context. Unicode is not an encoding, nor does it "have" an encoding.
Yes. Decode In: Encode Out.
For the repr of a non-unicode string literal, Python will use sys.stdin.encoding; for the repr of a unicode string literal, Python will use "unicode_escape".

Categories

Resources