How encode strings using python - python

I have lists with element
[u'\xd0\xbc\xd1\x82\xd1\x81 \xd0\xbe\xd1\x84\xd0\xb8\xd1\x86\xd0\xb8\xd0\xb0\xd0\xbb\xd1\x8c\xd0\xbd\xd1\x8b\xd0\xb9 \xd1\x81\xd0\xb0\xd0\xb9\xd1\x82']
[u'\xd0\xbc\xd1\x82\xd1\x81 \xd0\xbe\xd1\x84\xd0\xb8\xd1\x86\xd0\xb8\xd0\xb0\xd0\xbb\xd1\x8c\xd0\xbd\xd1\x8b\xd0\xb9 \xd1\x81\xd0\xb0\xd0\xb9\xd1\x82']
I try to convert it using
val[0].encode('utf-8')
And got after it
мÑÑ Ð¾ÑиÑиалÑнÑй ÑайÑ
мÑÑ Ð¾ÑиÑиалÑнÑй ÑайÑ
What I do wrong?

You have a Mojibake; text decoded using the wrong codec.
You have what looks like it was decoded or Latin-1 or Windows codepage 1252, while it should have been decoded as UTF-8 instead.
Either reverse the encoding manually, or use the excellent ftfy package to do it for you:
>>> import ftfy
>>> data = [u'\xd0\xbc\xd1\x82\xd1\x81 \xd0\xbe\xd1\x84\xd0\xb8\xd1\x86\xd0\xb8\xd0\xb0\xd0\xbb\xd1\x8c\xd0\xbd\xd1\x8b\xd0\xb9 \xd1\x81\xd0\xb0\xd0\xb9\xd1\x82']
>>> ftfy.ftfy(data[0])
u'\u043c\u0442\u0441 \u043e\u0444\u0438\u0446\u0438\u0430\u043b\u044c\u043d\u044b\u0439 \u0441\u0430\u0439\u0442'
>>> print ftfy.ftfy(data[0])
мтс официальный сайт
Manually, you'd re-encode as Latin-1:
>>> data[0].encode('latin1')
'\xd0\xbc\xd1\x82\xd1\x81 \xd0\xbe\xd1\x84\xd0\xb8\xd1\x86\xd0\xb8\xd0\xb0\xd0\xbb\xd1\x8c\xd0\xbd\xd1\x8b\xd0\xb9 \xd1\x81\xd0\xb0\xd0\xb9\xd1\x82'
>>> data[0].encode('latin1').decode('utf8')
u'\u043c\u0442\u0441 \u043e\u0444\u0438\u0446\u0438\u0430\u043b\u044c\u043d\u044b\u0439 \u0441\u0430\u0439\u0442'
>>> print data[0].encode('latin1').decode('utf8')
мтс официальный сайт
Note that you have a list with one unicode object in it. You may want to study up on Python and Unicode; I recommend the following documents:
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
These will help you understand when to encode and when to decode, and what codec to use.

Related

len(unicode string)

>>> c='中文'
>>> c
'\xe4\xb8\xad\xe6\x96\x87'
>>> len(c)
6
>>> cu=u'中文'
>>> cu
u'\u4e2d\u6587'
>>> len(cu)
2
>>> s='𤭢'
>>> s
'\xf0\xa4\xad\xa2'
>>> len(s)
4
>>> su=u'𤭢'
>>> su
u'\U00024b62'
>>> len(su)
2
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> sys.stdout.encoding
'UTF-8'
First, I want to make some concepts clear myself.
I've learned that unicode string like cu=u'中文' ,actually is encoded in UTF-16 by python shell default. Right? So, when we saw '\u*' , that actually UTF-16 encoding? And '\u4e2d\u6587' is an unicode string or byte string? But cu has to be stored in the memory, so
0100 1110 0010 1101 0110 0101 1000 0111
(convert \u4e2d\u6587 to binary) is the form that cu preserved if that a byte string? Am I right?
But it can't be byte string. Otherwise len(cu) can't be 2, it should be 4!!
So it has to be unicode string. BUT!!! I've also learned that
python attempts to implicitly encode the Unicode string with whatever
scheme is currently set in sys.stdout.encoding, in this instance it's
"UTF-8".
>>> cu.encode('utf-8')
'\xe4\xb8\xad\xe6\x96\x87'
So! how could len(cu) == 2??? Is that because there are two '\u' in it?
But that doesn't make len(su) == 2 sense!
Am I missing something?
I'm using python 2.7.12
The Python unicode type holds Unicode codepoints, and is not meant to be an encoding. How Python does this internally is an implementation detail and not something you need to be concerned with most of the time. They are not UTF-16 code units, because UTF-16 is another codec you can use to encode Unicode text, just like UTF-8 is.
The most important thing here is that a standard Python str object holds bytes, which may or may not hold text encoded to a certain codec (your sample uses UTF-8 but that's not a given), and unicode holds Unicode codepoints. In an interactive interpreter session, it is the codec of your terminal that determines what bytes are received by Python (which then uses sys.stdin.encoding to decode these as needed when you create a u'...' unicode object).
Only when writing to sys.stdout (say, when using print) does the sys.stdout.encoding value come to play, where Python will automatically encode your Unicode strings again. Only then will your 2 Unicode codepoints be encoded to UTF-8 again and written to your terminal, which knows how to interpret those.
You probably want to read up about Python and Unicode, I recommend:
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO

Python UTF-8 Latin-1 displays wrong character

I'm writing a very small script that can convert latin-1 characters into unicode (I'm a complete beginner in Python).
I tried a method like this:
def latin1_to_unicode(character):
uni = character.decode('latin-1').encode("utf-8")
retutn uni
It works fine for characters that are not specific to the latin-1 set, but if I try the following example:
print latin1_to_Unicode('å')
It returns Ã¥ instead of å. Same goes for other letters like æ and ø.
Can anyone please explain why this is happening?
Thanks
I have the # -*- coding: utf8 -*- declaration in my script, if it matters any to the problem
Your source code is encoded to UTF-8, but you are decoding the data as Latin-1. Don't do that, you are creating a Mojibake.
Decode from UTF-8 instead, and don't encode again. print will write to sys.stdout which will have been configured with your terminal or console codec (detected when Python starts).
My terminal is configured for UTF-8, so when I enter the å character in my terminal, UTF-8 data is produced:
>>> 'å'
'\xc3\xa5'
>>> 'å'.decode('latin1')
u'\xc3\xa5'
>>> print 'å'.decode('latin1')
Ã¥
You can see that the character uses two bytes; when saving your Python source with an editor configured to use UTF-8, Python reads the exact same bytes from disk to put into your bytestring.
Decoding those two bytes as Latin-1 produces two Unicode codepoints corresponding to the Latin-1 codec.
You probably want to do some studying on the difference between Unicode and encodings, and how that relates to Python:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
The Python Unicode HOWTO

Django Latin Characters do not work

I'm trying to solve this issue when importing a CSV file. I try to save a string variable that contains latin-1 characters and when I try to print them, it changes it to an encoding. Is there anything I can do to keep the encoding? I simply want to keep the character as it is, nothing else.
Here's the issue (as seen from Django's manage shell
>>> variable = "{'job_title': 'préventeur'}"
>>> variable
"{'job_title': 'pr\xc3\xa9venteur'}"
Why does Django or Python automatically change the string? Do I have to change the characterset or something?
Anything will help. Thanks!
Your terminal is entering encoded characters; you are using UTF-8, and thus Python receives two bytes when you type é.
Decode from UTF-8 in that case:
>>> print 'pr\xc3\xa9venteur'.decode('utf8')
préventeur
You really want to read up on Python and Unicode though:
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
"{'job_title': 'pr\xc3\xa9venteur'}"
The characters have been encoded into UTF-8 for you, which is pretty nice, because you don't want to stick with Latin-1 if you value your sanity. Convert to Unicode for best results:
>>> '\xc3\xa9'.decode('UTF-8')
u'é'
Have you tried using print statement instead?
>>> variable = "{'job_title': 'préventeur'}"
>>> variable
"{'job_title': 'pr\x82venteur'}"
>>> repr(variable)
'"{\'job_title\': \'pr\\x82venteur\'}"'
>>> print variable
{'job_title': 'préventeur'}

How do I convert a unicode to a string at the Python level?

The following unicode and string can exist on their own if defined explicitly:
>>> value_str='Andr\xc3\xa9'
>>> value_uni=u'Andr\xc3\xa9'
If I only have u'Andr\xc3\xa9' assigned to a variable like above, how do I convert it to 'Andr\xc3\xa9' in Python 2.5 or 2.6?
EDIT:
I did the following:
>>> value_uni.encode('latin-1')
'Andr\xc3\xa9'
which fixes my issue. Can someone explain to me what exactly is happening?
You seem to have gotten your encodings muddled up. It seems likely that what you really want is u'Andr\xe9' which is equivalent to 'André'.
But what you have seems to be a UTF-8 encoding that has been incorrectly decoded. You can fix it by converting the unicode string to an ordinary string. I'm not sure what the best way is, but this seems to work:
>>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9')
'Andr\xc3\xa9'
Then decode it correctly:
>>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9').decode('utf8')
u'Andr\xe9'
Now it is in the correct format.
However instead of doing this, if possible you should try to work out why the data has been incorrectly encoded in the first place, and fix that problem there.
You asked (in a comment) """That is what's puzzling me. How did it go from it original accented to what it is now? When you say double encoding with utf8 and latin1, is that a total of 3 encodings(2 utf8 + 1 latin1)? What's the order of the encode from the original state to the current one?"""
In the answer by Mark Byers, he says """what you have seems to be a UTF-8 encoding that has been incorrectly decoded""". You have accepted his answer. But you are still puzzled? OK, here's the blow-by-blow description:
Note: All strings will be displayed using (implicitly) repr(). unicodedata.name() will be used to verify the contents. That way, variations in console encoding cannot confuse interpretation of the strings.
Initial state: you have a unicode object that you have named u1. It contains e-acute:
>>> u1 = u'\xe9'
>>> import unicodedata as ucd
>>> ucd.name(u1)
'LATIN SMALL LETTER E WITH ACUTE'
You encode u1 as UTF-8 and name the result s:
>>> s = u1.encode('utf8')
>>> s
'\xc3\xa9'
You decode s using latin1 -- INCORRECTLY; s was encoded using utf8, NOT latin1. The result is meaningless rubbish.
>>> u2 = s.decode('latin1')
>>> u2
u'\xc3\xa9'
>>> ucd.name(u2[0]); ucd.name(u2[1])
'LATIN CAPITAL LETTER A WITH TILDE'
'COPYRIGHT SIGN'
>>>
Please understand: unicode_object.encode('x').decode('y) when x != y is normally [see note below] a nonsense; it will raise an exception if you are lucky; if you are unlucky it will silently create gibberish. Also please understand that silently creating gibberish is not a bug -- there is no general way that Python (or any other language) can detect that a nonsense has been committed. This applies particularly when latin1 is involved, because all 256 codepoints map 1 to 1 with the first 256 Unicode codepoints, so it is impossible to get a UnicodeDecodeError from str_object.decode('latin1').
Of course, abnormally (one hopes that it's abnormal) you may need to reverse out such a nonsense by doing gibberish_unicode_object.encode('y').decode('x') as suggested in various answers to your question.
If you have u'Andr\xc3\xa9', that is a Unicode string that was decoded from a byte string with the wrong encoding. The correct encoding is UTF-8. To convert it back to a byte string so you can decode it correctly, you can use the trick you discovered. The first 256 code points of Unicode are a 1:1 mapping with ISO-8859-1 (alias latin1) encoding. So:
>>> u'Andr\xc3\xa9'.encode('latin1')
'Andr\xc3\xa9'
Now it is a byte string that can be decoded correctly with utf8:
>>> 'Andr\xc3\xa9'.decode('utf8')
u'Andr\xe9'
>>> print 'Andr\xc3\xa9'.decode('utf8')
André
In one step:
>>> print u'Andr\xc3\xa9'.encode('latin1').decode('utf8')
André
value_uni.encode('utf8') or whatever encoding you need.
See http://docs.python.org/library/stdtypes.html#str.encode
The OP is not converting to ascii nor utf-8. That's why the suggested encode methods won't work. Try this:
v = u'Andr\xc3\xa9'
s = ''.join(map(lambda x: chr(ord(x)),v))
The chr(ord(x)) business gets the numeric value of the unicode character (which better fit in one byte for your application), and the ''.join call is an idiom that converts a list of ints back to an ordinary string. No doubt there is a more elegant way.
Simplified explanation. The str type is able to hold only characters from 0-255 range. If you want to store unicode (which can contain characters from much wider range) in str you first have to encode unicode to format suitable for str, for example UTF-8.
To do this call method encode on your str object and as an argument give desired encoding, for example this_is_str = value_uni.encode('utf-8').
You can read longer and more in-depth (and language agnostic) article on Unicode handling here: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Another excellent article (this time Python-specific): Unicode HOWTO
It seems like
str(value_uni)
should work... at least, it did when I tried it.
EDIT: Turns out that this only works because my system's default encoding is, as far as I can tell, ISO-8859-1 (Latin-1). So for a platform-independent version of this, try
value_uni.encode('latin1')

How do I convert a file's format from Unicode to ASCII using Python?

I use a 3rd party tool that outputs a file in Unicode format. However, I prefer it to be in ASCII. The tool does not have settings to change the file format.
What is the best way to convert the entire file format using Python?
You can convert the file easily enough just using the unicode function, but you'll run into problems with Unicode characters without a straight ASCII equivalent.
This blog recommends the unicodedata module, which seems to take care of roughly converting characters without direct corresponding ASCII values, e.g.
>>> title = u"Klüft skräms inför på fédéral électoral große"
is typically converted to
Klft skrms infr p fdral lectoral groe
which is pretty wrong. However, using the unicodedata module, the result can be much closer to the original text:
>>> import unicodedata
>>> unicodedata.normalize('NFKD', title).encode('ascii','ignore')
'Kluft skrams infor pa federal electoral groe'
I think this is a deeper issue than you realize. Simply changing the file from Unicode into ASCII is easy, however, getting all of the Unicode characters to translate into reasonable ASCII counterparts (many letters are not available in both encodings) is another.
This Python Unicode tutorial may give you a better idea of what happens to Unicode strings that are translated to ASCII: http://www.reportlab.com/i18n/python_unicode_tutorial.html
Here's a useful quote from the site:
Python 1.6 also gets a "unicode"
built-in function, to which you can
specify the encoding:
> >>> unicode('hello') u'hello'
> >>> unicode('hello', 'ascii') u'hello'
> >>> unicode('hello', 'iso-8859-1') u'hello'
> >>>
All three of these return the same
thing, since the characters in 'Hello'
are common to all three encodings.
Now let's encode something with a
European accent, which is outside of
ASCII. What you see at a console may
depend on your operating system
locale; Windows lets me type in
ISO-Latin-1.
> >>> a = unicode('André','latin-1')
> >>> a u'Andr\202'
If you can't type an acute letter e,
you can enter the string 'Andr\202',
which is unambiguous.
Unicode supports all the common
operations such as iteration and
splitting. We won't run over them
here.
By the way, these is a linux command iconv to do this kind of job.
iconv -f utf8 -t ascii <input.txt >output.txt
Here's some simple (and stupid) code to do encoding translation. I'm assuming (but you shouldn't) that the input file is in UTF-16 (Windows calls this simply 'Unicode').
input_codec = 'UTF-16'
output_codec = 'ASCII'
unicode_file = open('filename')
unicode_data = unicode_file.read().decode(input_codec)
ascii_file = open('new filename', 'w')
ascii_file.write(unicode_data.write(unicode_data.encode(output_codec)))
Note that this will not work if there are any characters in the Unicode file that are not also ASCII characters. You can do the following to turn unrecognized characters into '?'s:
ascii_file.write(unicode_data.write(unicode_data.encode(output_codec, 'replace')))
Check out the docs for more simple choices. If you need to do anything more sophisticated, you may wish to check out The UNICODE Hammer at the Python Cookbook.
Like this:
uc = open(filename).read().decode('utf8')
ascii = uc.decode('ascii')
Note, however, that this will fail with a UnicodeDecodeError exception if there are any characters that can't be converted to ASCII.
EDIT: As Pete Karl just pointed out, there is no one-to-one mapping from Unicode to ASCII. So some characters simply can't be converted in an information-preserving way. Moreover, standard ASCII is more or less a subset of UTF-8, so you don't really even need to do any decoding.
For my problem where I just wanted to skip the Non-ascii characters and just output only ascii output, the below solution worked really well:
import unicodedata
input = open(filename).read().decode('UTF-16')
output = unicodedata.normalize('NFKD', input).encode('ASCII', 'ignore')
It's important to note that there is no 'Unicode' file format. Unicode can be encoded to bytes in several different ways. Most commonly UTF-8 or UTF-16. You'll need to know which one your 3rd-party tool is outputting. Once you know that, converting between different encodings is pretty easy:
in_file = open("myfile.txt", "rb")
out_file = open("mynewfile.txt", "wb")
in_byte_string = in_file.read()
unicode_string = bytestring.decode('UTF-16')
out_byte_string = unicode_string.encode('ASCII')
out_file.write(out_byte_string)
out_file.close()
As noted in the other replies, you're probably going to want to supply an error handler to the encode method. Using 'replace' as the error handler is simple, but will mangle your text if it contains characters that cannot be represented in ASCII.
As other posters have noted, ASCII is a subset of unicode.
However if you:
have a legacy app
you don't control the code for that app
you're sure your input falls into the ASCII subset
Then the example below shows how to do it:
mystring = u'bar'
type(mystring)
<type 'unicode'>
myasciistring = (mystring.encode('ASCII'))
type(myasciistring)
<type 'str'>

Categories

Resources