How to convert a string to utf-8 in Python - python

I have a browser which sends utf-8 characters to my Python server, but when I retrieve it from the query string, the encoding that Python returns is ASCII. How can I convert the plain string to utf-8?
NOTE: The string passed from the web is already UTF-8 encoded, I just want to make Python to treat it as UTF-8 not ASCII.

In Python 2
>>> plain_string = "Hi!"
>>> unicode_string = u"Hi!"
>>> type(plain_string), type(unicode_string)
(<type 'str'>, <type 'unicode'>)
^ This is the difference between a byte string (plain_string) and a unicode string.
>>> s = "Hello!"
>>> u = unicode(s, "utf-8")
^ Converting to unicode and specifying the encoding.
In Python 3
All strings are unicode. The unicode function does not exist anymore. See answer from #Noumenon

If the methods above don't work, you can also tell Python to ignore portions of a string that it can't convert to utf-8:
stringnamehere.decode('utf-8', 'ignore')

Might be a bit overkill, but when I work with ascii and unicode in same files, repeating decode can be a pain, this is what I use:
def make_unicode(inp):
if type(inp) != unicode:
inp = inp.decode('utf-8')
return inp

Adding the following line to the top of your .py file:
# -*- coding: utf-8 -*-
allows you to encode strings directly in your script, like this:
utfstr = "ボールト"

If I understand you correctly, you have a utf-8 encoded byte-string in your code.
Converting a byte-string to a unicode string is known as decoding (unicode -> byte-string is encoding).
You do that by using the unicode function or the decode method. Either:
unicodestr = unicode(bytestr, encoding)
unicodestr = unicode(bytestr, "utf-8")
Or:
unicodestr = bytestr.decode(encoding)
unicodestr = bytestr.decode("utf-8")

city = 'Ribeir\xc3\xa3o Preto'
print city.decode('cp1252').encode('utf-8')

In Python 3.6, they do not have a built-in unicode() method.
Strings are already stored as unicode by default and no conversion is required. Example:
my_str = "\u221a25"
print(my_str)
>>> √25

Translate with ord() and unichar().
Every unicode char have a number asociated, something like an index. So Python have a few methods to translate between a char and his number. Downside is a ñ example. Hope it can help.
>>> C = 'ñ'
>>> U = C.decode('utf8')
>>> U
u'\xf1'
>>> ord(U)
241
>>> unichr(241)
u'\xf1'
>>> print unichr(241).encode('utf8')
ñ

First, str in Python is represented in Unicode.
Second, UTF-8 is an encoding standard to encode Unicode string to bytes. There are many encoding standards out there (e.g. UTF-16, ASCII, SHIFT-JIS, etc.).
When the client sends data to your server and they are using UTF-8, they are sending a bunch of bytes not str.
You received a str because the "library" or "framework" that you are using, has implicitly converted some random bytes to str.
Under the hood, there is just a bunch of bytes. You just need ask the "library" to give you the request content in bytes and you will handle the decoding yourself (if library can't give you then it is trying to do black magic then you shouldn't use it).
Decode UTF-8 encoded bytes to str: bs.decode('utf-8')
Encode str to UTF-8 bytes: s.encode('utf-8')

The url is translated to ASCII and to the Python server it is just a Unicode string, eg.:
"T%C3%A9st%C3%A3o"
Python understands "é" and "ã" as actual %C3%A9 and %C3%A3.
You can encode an URL just like this:
import urllib
url = "T%C3%A9st%C3%A3o"
print(urllib.parse.unquote(url))
>> Téstão
See https://www.adamsmith.haus/python/answers/how-to-decode-a-utf-8-url-in-python for details.

you can also do this:
from unidecode import unidecode
unidecode(yourStringtoDecode)

You can use python's standard library codecs module.
import codecs
codecs.decode(b'Decode me', 'utf-8')

Yes, You can add
# -*- coding: utf-8 -*-
in your source code's first line.
You can read more details here https://www.python.org/dev/peps/pep-0263/

Related

How do I convert a string formatted like unicode into unicode?

I have a variable with the string '\u96e8' and I want to convert this to unicode, because the function kanji_to_romaji() only accepts unicode. How would I do this? I am on python 2.7
# -*- coding: UTF-8 -*-
from kanji_to_romaji import kanji_to_romaji
message = '\u96e8'
message = unicode(message)
x = kanji_to_romaji(message)
print(x)
You can decode the bytestring to unicode using the unicode-escape codec.
>>> message = '\u96e8'
>>> unicode_message = message.decode('unicode-escape')
>>> unicode_message
u'\u96e8'
>>> print unicode_message
雨
Use ast.literal_eval:
>>> message = '\u96e8'
>>> ast.literal_eval('u"{}"'.format(message))
u'\u96e8'
The trick is to construct a string containing a unicode string literal to pass as the argument to literal_eval. That is, u"\u96e8" rather than just \u96e8.
(This is only partially correct, though. It will fail if the value of message itself contains a double-quote. There are probably other cases where this fails as well.)

bytes with embedded literal \xhh escapes to unicode

I have: b'{"street":"Grossk\\xc3\\xb6lnstra\\xc3\\x9fe"}'
I need: '{"street": "Grosskölnstraße"}'
I tried:
s.decode('utf8'): # '{"street":"Grossk\\xc3\\xb6lnstra\\xc3\\x9fe"}'
s.decode('unicode_escape'): # '{"street":"GrosskölnstraÃ\x9fe"}'
What's the correct way?
That's.. quite a mess you have there. That looks like UTF-8 bytes embedded as Python byte escape sequences.
There is no codec that'll produce bytes as output again; you'll need to use the unicode_escape sequence then re-encode as Latin-1 to go back to UTF8 bytes, then decode as UTF-8:
s.decode('unicode_escape').encode('latin1').decode('utf8')
Demo:
>>> s = b'{"street":"Grossk\\xc3\\xb6lnstra\\xc3\\x9fe"}'
>>> s.decode('unicode_escape').encode('latin1').decode('utf8')
'{"street":"Grosskölnstraße"}'
Another option is to target just the \x[hexdigits]{3} pattern in a regex; this may be the more robust option if the specific data wasn't produced by a faulty Python script:
import re
from functools import partial
escape = re.compile(rb'\\x([\da-f]{2})')
repair = partial(escape.sub, lambda m: bytes.fromhex(m.group(1).decode()))
repair() returns a bytes object:
>>> repair(s)
b'{"street":"Grossk\xc3\xb6lnstra\xc3\x9fe"}'
>>> repair(s).decode('utf8')
'{"street":"Grosskölnstraße"}'

Correctly decoding hex escaped unicode strings in python [duplicate]

I have a unicode string like '%C3%A7%C3%B6asd+fjkls%25asd' and I want to decode this string.
I used urllib.unquote_plus(str) but it works wrong.
expected : çöasd+fjkls%asd
result : çöasd fjkls%asd
double coded utf-8 characters(%C3%A7 and %C3%B6) are decoded wrong.
My python version is 2.7 under a linux distro.
What is the best way to get expected result?
You have 3 or 4 or 5 problems ... but repr() and unicodedata.name() are your friends; they unambiguously show you exactly what you have got, without the confusion engendered by people with different console encodings communicating the results of print fubar.
Summary: either (a) you start with a unicode object and apply the unquote function to that or (b) you start off with a str object and your console encoding is not UTF-8.
If as you say you start off with a unicode object:
>>> s0 = u'%C3%A7%C3%B6asd+fjkls%25asd'
>>> print repr(s0)
u'%C3%A7%C3%B6asd+fjkls%25asd'
this is an accidental nonsense. If you apply urllibX.unquote_YYYY() to it, you get another nonsense unicode object (u'\xc3\xa7\xc3\xb6asd+fjkls%asd') which would cause your shown symptoms when printed. You should convert your original unicode object to a str object immediately:
>>> s1 = s0.encode('ascii')
>>> print repr(s1)
'%C3%A7%C3%B6asd+fjkls%25asd'
then you should unquote it:
>>> import urllib2
>>> s2 = urllib2.unquote(s1)
>>> print repr(s2)
'\xc3\xa7\xc3\xb6asd+fjkls%asd'
Looking at the first 4 bytes of that, it's encoded in UTF-8. If you do print s2, it will look OK if your console is expecting UTF-8, but if it's expecting ISO-8859-1 (aka latin1) you'll see your symptomatic rubbish (first char will be A-tilde). Let's park that thought for a moment and convert it to a Unicode object:
>>> s3 = s2.decode('utf8')
>>> print repr(s3)
u'\xe7\xf6asd+fjkls%asd'
and inspect it to see what we've actually got:
>>> import unicodedata
>>> for c in s3[:6]:
... print repr(c), unicodedata.name(c)
...
u'\xe7' LATIN SMALL LETTER C WITH CEDILLA
u'\xf6' LATIN SMALL LETTER O WITH DIAERESIS
u'a' LATIN SMALL LETTER A
u's' LATIN SMALL LETTER S
u'd' LATIN SMALL LETTER D
u'+' PLUS SIGN
Looks like what you said you expected. Now we come to the question of displaying it on your console. Note: don't freak out when you see "cp850"; I'm doing this portably and just happen to be doing this in a Command Prompt on Windows.
>>> import sys
>>> sys.stdout.encoding
'cp850'
>>> print s3
çöasd+fjkls%asd
Note: the unicode object was explicitly encoded using sys.stdout.encoding. Fortunately all the unicode characters in s3 are representable in that encoding (and cp1252 and latin1).
Using either unquote or unquote_plus will give you a byte string. If you want a Unicode string then you have to decode the byte string to unicode:
>>> print(urllib.unquote_plus('%C3%A7%C3%B6asd+fjkls%25asd').decode('utf8'))
çöasd fjkls%asd
>>>
Compared with:
>>> print(urllib.unquote_plus('%C3%A7%C3%B6asd+fjkls%25asd'))
çöasd fjkls%asd
>>>
Note that your input string must be a byte string: if you pass unicode to unquote/unquote_plus then you'll get a bit of a mess. If this is the case then encode it first:
>>> print(urllib.unquote_plus(u'%C3%A7%C3%B6asd+fjkls%25asd'.encode('ascii')).decode('utf8'))
çöasd fjkls%asd
Try urllib2 once more:
print urllib2.unquote('%C3%A7%C3%B6asd+fjkls%25asd')
'%C3%A7%C3%B6asd+fjkls%25asd' - this is not a unicode string.
This is a url-encoded string. Use urllib2.unquote() instead.
You have a double problem: your string is unicode encoded and contains caracter urlencoded. Some match. You can normalize your string to ascci to be sure it won't be interpreted incorrectly:
>>> s = '%C3%A7%C3%B6asd+fjkls%25asd' # ascii string
>>> print urllib2.unquote(s) # works as expected
çöasd+fjkls%asd
>>> s = u'%C3%A7%C3%B6asd+fjkls%25asd' # unicode string
>>> print urllib2.unquote(s) # decode stuff that it shouldn't
çöasd+fjkls%asd
>>> print urllib2.unquote(s.encode('ascii')) # encode the unicode string to ascii: works!
çöasd+fjkls%asd
You are using unquote_plus method which is taking space into account and converting to +. Just use unquote method and you should be fine.
>>> import urllib
>>> print urllib.unquote('%C3%A7%C3%B6asd+fjkls%25asd')
çöasd+fjkls%asd
>>> print urllib.unquote_plus('%C3%A7%C3%B6asd+fjkls%25asd')
çöasd fjkls%asd

Python UnicodeDecodeError when writing German letters

I've been banging my head on this error for some time now and I can't seem to find a solution anywhere on SO, even though there are similar questions.
Here's my code:
f = codecs.open(path, "a", encoding="utf-8")
value = "Bitte überprüfen"
f.write(("\"%s\" = \"%s\";\n" % ("no_internet", value)).encode("utf-8"))
And what I get as en error is:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128)
Why ascii if I say utf-8? I would really appreciate any help.
Try:
value = u"Bitte überprüfen"
in order to declare value as a unicode string and
# -*- coding: utf-8 -*-
at the start of your file in order to declare that your python file is saved with utf-8 encoding.
For the sake of never being hurt by unicode errors ever again, switch to python3:
% python3
>>> with open('/tmp/foo', 'w') as f:
... value = "Bitte überprüfen"
... f.write(('"{}" = "{}";\n'.format('no_internet', value)))
...
36
>>> import sys
>>> sys.exit(0)
% cat /tmp/foo
"no_internet" = "Bitte überprüfen";
though if you're really tied to python2 and have no choice:
% python2
>>> with open('/tmp/foo2', 'w') as f:
... value = u"Bitte überprüfen"
... f.write(('"{}" = "{}";\n'.format('no_internet', value.encode('utf-8'))))
...
>>> import sys
>>> sys.exit(0)
% cat /tmp/foo2
"no_internet" = "Bitte überprüfen";
And as #JuniorCompressor suggests, don't forget to add # encoding: utf-8 at the start of your python2 file to tell python to read the source file in unicode, not in ASCII!
Your error in:
f.write(("\"%s\" = \"%s\";\n" % ("no_internet", value)).encode("utf-8"))
is that you're encoding the whole formatted string into utf-8, whereas you shall encode the value string into utf-8 before doing the format:
>>> with open('/tmp/foo2', 'w') as f:
... value = u"Bitte überprüfen"
... f.write(('"{}" = "{}";\n'.format('no_internet', value).encode('utf-8')))
...
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 6: ordinal not in range(128)
Which is because python needs to first decode the string into utf-8, so you have to use the unicode type (which is what u"" does). Then you need to explicitly decode that value as unicode before feeding it to the format parser, to build the new string.
As Karl says in his answer, Python2 is totally messy/buggy when using unicode strings, defeating the Explicit is better than implicit zen of python. And for more weird behaviour, the following works just fine in python2:
>>> value = "Bitte überprüfen"
>>> out = '"{}" = "{}";\n'.format('no_internet', value)
>>> out
'"no_internet" = "Bitte \xc3\xbcberpr\xc3\xbcfen";\n'
>>> print(out)
"no_internet" = "Bitte überprüfen";
Still not convinced to switch to python3 ? :-)
Update:
This is the way to go to read and write an unicode string from a file to another file:
% echo "Bitte überprüfen" > /tmp/foobar
% python2
>>> with open('/tmp/foobar', 'r') as f:
... data = f.read().decode('utf-8').strip()
...
>>>
>>> with open('/tmp/foo2', 'w') as f:
... f.write(('"{}" = "{}";\n'.format('no_internet', data.encode('utf-8'))))
...
>>> import sys;sys.exit(0)
% cat /tmp/foo2
"no_internet" = "Bitte überprüfen";
Update:
as a general rule:
when you get a DecodeError you shall use the .decode('utf-8') on the string that contains unicode data and
when you get an EncodeError, you shall use the .encode('utf-8') on the string that contains unicode data
Update: if you cannot update to python3, you can at least make your python2 behave like it is almost python3, using the following python-future import statement:
from __future__ import absolute_import, division, print_function, unicode_literals
HTH
Like already suggested your error results from this line:
f.write(("\"%s\" = \"%s\";\n" % ("no_internet", value)).encode("utf-8"))
it should be:
f.write(('"{}" = "{}";\n'.format('no_internet', value.encode('utf-8'))))
A note on unicode and encodings
If woking with Python 2, software should only work with unicode strings internally, converting to a particular encoding on output.
Do prevent from making the same error over and over again you should make sure you understood the difference between ascii and utf-8 encodings and also between str and unicode objects in Python.
The difference between ASCII and UTF-8 encoding:
Ascii needs just one byte to represent all possible characters in the ascii charset/encoding. UTF-8 needs up to four bytes to represent the complete charset.
ascii (default)
1 If the code point is < 128, each byte is the same as the value of the code point.
2 If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)
utf-8 (unicode transformation format)
1 If the code point is <128, it’s represented by the corresponding byte value.
2 If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.
3 Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.
The difference between str and unicode objects:
You can say that str is baiscally a byte string and unicode is a unicode string. Both can have a different encoding like ascii or utf-8.
str vs. unicode
1 str = byte string (8-bit) - uses \x and two digits
2 unicode = unicode string - uses \u and four digits
3 basestring
/\
/ \
str unicode
If you follow some simple rules you should go fine with handling str/unicode objects in different encodings like ascii or utf-8 or whatever encoding you have to use:
Rules
1 encode(): Gets you from Unicode -> bytes
encode([encoding], [errors='strict']), returns an 8-bit string version of the Unicode string,
2 decode(): Gets you from bytes -> Unicode
decode([encoding], [errors]) method that interprets the 8-bit string using the given encoding
3 codecs.open(encoding=”utf-8″): Read and write files directly to/from Unicode (you can use any encoding, not just utf-8, but utf-8 is most common).
4 u”: Makes your string literals into Unicode objects rather than byte sequences.
5 unicode(string[, encoding, errors])
Warning: Don’t use encode() on bytes or decode() on Unicode objects
And again: Software should only work with Unicode strings internally, converting to a particular encoding on output.
Why ascii if I say utf-8?
Because in Python 2, "Bitte überprüfen" is not a Unicode string. Before it can be .encoded by your explicit call, Python must implicitly decode it to Unicode (This is also why it raises a UnicodeDecodeError), and it chooses ASCII because it has no other information to work with. The ü is represented with some byte with value >= 128, so it's not valid ASCII.
The u prefix shown by #JuniorCompressor will make it a Unicode string, and you should specify the encoding for the file as well (don't just blindly set utf-8; it needs to match whatever your text editor saves the .py file with!).
Switching to Python 3 is realistically (part of) a better long-term solution :) but it is still essential to understand the problem. See http://bit.ly/unipain for more details. The Python 2 behaviour is really a bug, or at least a failure to meet Pythonic design principles: Explicit is better than implicit, and here we see why very clearly ;)

Python's string.maketrans works at home but fails on Google App Engine

I have this code in Google AppEngine (Python SDK):
from string import maketrans
intab = u"ÀÁÂÃÄÅàáâãäåÒÓÔÕÖØòóôõöøÈÉÊËèéêëÇçÌÍÎÏìíîïÙÚÛÜùúûüÿÑñ".encode('latin1')
outtab = u"aaaaaaaaaaaaooooooooooooeeeeeeeecciiiiiiiiuuuuuuuuynn".encode('latin1')
logging.info(len(intab))
logging.info(len(outtab))
trantab = maketrans(intab, outtab)
When I run the code in the interactive console I have no problem, but when I try it in GAE I get the following error:
raise ValueError, "maketrans arguments must have same length"
ValueError: maketrans arguments must have same length
INFO 2009-12-03 20:04:02,904 dev_appserver.py:3038] "POST /backendsavenew HTTP/1.1" 500 -
INFO 2009-12-03 20:08:37,649 admin.py:112] 106
INFO 2009-12-03 20:08:37,651 admin.py:113] 53
ERROR 2009-12-03 20:08:37,653 init.py:388] maketrans arguments must have same length
I can't figure out why the intab it's doubled in size.
The python file with the code is saved as UTF-8.
Thanks in advance for any help.
string.maketrans and string.translate do not work for Unicode strings. Your call to string.maketrans will implictly convert the Unicode you gave it to an encoding like utf-8. In utf-8 å takes up more space than ASCII a. string.maketrans sees len(str(argument)) which is different for your two strings.
There is a Unicode translate, but for your use case (convert Unicode to ASCII because some part of your system cannot deal with Unicode) you should use http://pypi.python.org/pypi/Unidecode. Unidecode is very smart about transliterating Unicode characters to sensible ASCII, covering many more characters than in your example.
You should save your Python code as utf-8, but make sure you add the magic so Python doesn't have to assume you used the system's default encoding. This line should be the first or second line of your Python files:
# -*- coding: utf-8 -*-
There are many advantages to processing text as Unicode instead of binary strings. This is the Unicode way to do what you are trying to do:
intab = u"ÀÁÂÃÄÅàáâãäåÒÓÔÕÖØòóôõöøÈÉÊËèéêëÇçÌÍÎÏìíîïÙÚÛÜùúûüÿÑñ"
outtab = u"aaaaaaaaaaaaooooooooooooeeeeeeeecciiiiiiiiuuuuuuuuynn"
trantab = dict((ord(a), b) for a, b in zip(intab, outtab))
translated = intab.translate(trantab)
translated == outtab # True
See also Where is Python's "best ASCII for this Unicode" database?
See also How do I get str.translate to work with Unicode strings?
Maybe you could use iso-8859-1 encoding for your file instead of utf-8
# -*- coding: iso-8859-1 -*-
from string import maketrans
import logging
intab = "ÀÁÂÃÄÅàáâãäåÒÓÔÕÖØòóôõöøÈÉÊËèéêëÇçÌÍÎÏìíîïÙÚÛÜùúûüÿÑñ"
outtab = "aaaaaaaaaaaaooooooooooooeeeeeeeecciiiiiiiiuuuuuuuuynn"
logging.info(len(intab))
logging.info(len(outtab))
trantab = maketrans(intab, outtab)
Remember to select iso-8859-1 in your text editor while saving this python source file.

Categories

Resources