Converting double slash utf-8 encoding - python

I cannot get this to work! I have a text file from a save game file parser with a bunch of UTF-8 Chinese names in it in byte form, like this in the source.txt:
\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89
But, no matter how I import it into Python (3 or 2), I get this string, at best:
\\xe6\\x89\\x8e\\xe5\\x8a\\xa0\\xe6\\x8b\\x89
I have tried, like other threads have suggested, to re-encode the string as UTF-8 and then decode it with unicode escape, like so:
stringName.encode("utf-8").decode("unicode_escape")
But then it messes up the original encoding, and gives this as the string:
'æ\x89\x8eå\x8a\xa0æ\x8b\x89' (printing this string results in: æå æ )
Now, if I manually copy and paste b + the original string in the filename and encode this, I get the correct encoding. For example:
b'\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89'.encode("utf-8")
Results in: '扎加拉'
But, I can't do this programmatically. I can't even get rid of the double slashes.
To be clear, source.txt contains single backslashes. I have tried importing it in many ways, but this is the most common:
with open('source.txt','r',encoding='utf-8') as f_open:
source = f_open.read()
Okay, so I clicked the answer below (I think), but here is what works:
from ast import literal_eval
decodedString = literal_eval("b'{}'".format(stringVariable)).decode('utf-8')
I can't use it on the whole file because of other encoding issues, but extracting each name as a string (stringVariable) and then doing that works! Thank you!
To be more clear, the original file is not just these messed up utf encodings. It only uses them for certain fields. For example, here is the beginning of the file:
{'m_cacheHandles': ['s2ma\x00\x00CN\x1f\x1b"\x8d\xdb\x1fr \\\xbf\xd4D\x05R\x87\x10\x0b\x0f9\x95\x9b\xe8\x16T\x81b\xe4\x08\x1e\xa8U\x11',
's2ma\x00\x00CN\x1a\xd9L\x12n\xb9\x8aL\x1d\xe7\xb8\xe6\xf8\xaa\xa1S\xdb\xa5+\t\xd3\x82^\x0c\x89\xdb\xc5\x82\x8d\xb7\x0fv',
's2ma\x00\x00CN\x92\xd8\x17D\xc1D\x1b\xf6(\xedj\xb7\xe9\xd1\x94\x85\xc8`\x91M\x8btZ\x91\xf65\x1f\xf9\xdc\xd4\xe6\xbb',
's2ma\x00\x00CN\xa1\xe9\xab\xcd?\xd2PS\xc9\x03\xab\x13R\xa6\x85u7(K2\x9d\x08\xb8k+\xe2\xdeI\xc3\xab\x7fC',
's2ma\x00\x00CNN\xa5\xe7\xaf\xa0\x84\xe5\xbc\xe9HX\xb93S*sj\xe3\xf8\xe7\x84`\xf1Ye\x15~\xb93\x1f\xc90',
's2ma\x00\x00CN8\xc6\x13F\x19\x1f\x97AH\xfa\x81m\xac\xc9\xa6\xa8\x90s\xfdd\x06\rL]z\xbb\x15\xdcI\x93\xd3V'],
'm_campaignIndex': 0,
'm_defaultDifficulty': 7,
'm_description': '',
'm_difficulty': '',
'm_gameSpeed': 4,
'm_imageFilePath': '',
'm_isBlizzardMap': True,
'm_mapFileName': '',
'm_miniSave': False,
'm_modPaths': None,
'm_playerList': [{'m_color': {'m_a': 255, 'm_b': 255, 'm_g': 92, 'm_r': 36},
'm_control': 2,
'm_handicap': 0,
'm_hero': '\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89',
All of the information before the 'm_hero': field is not utf-8. So using ShadowRanger's solution works if the file is only made up of these fake utf-encodings, but it doesn't work when I have already parsed m_hero as a string and try to convert that. Karin's solution does work for that.

The problem is that the unicode_escape codec is implicitly decoding the result of the escape fixes by assuming the bytes are latin-1, not utf-8. You can fix this by:
# Read the file as bytes:
with open(myfile, 'rb') as f:
data = f.read()
# Decode with unicode-escape to get Py2 unicode/Py3 str, but interpreted
# incorrectly as latin-1
badlatin = data.decode('unicode-escape')
# Encode back as latin-1 to get back the raw bytes (it's a 1-1 encoding),
# then decode them properly as utf-8
goodutf8 = badlatin.encode('latin-1').decode('utf-8')
Which (assuming the file contains the literal backslashes and codes, not the bytes they represent) leaves you with '\u624e\u52a0\u62c9' (Which should be correct, I'm just on a system without font support for those characters, so that's just the safe repr based on Unicode escapes). You could skip a step in Py2 by using the string-escape codec for the first stage decode (which I believe would allow you to omit the .encode('latin-1') step), but this solution should be portable, and the cost shouldn't be terrible.

I'm assuming you're using Python 3. In Python 2, strings are bytes by default, so it would just work for you. But in Python 3, strings are unicode and interpretted as unicode, which is what makes this problem harder if you have a byte string being read as unicode.
This solution was inspired by mgilson's answer. We can literally evaluate your unicode string as a byte string by using literal_eval:
from ast import literal_eval
with open('source.txt', 'r', encoding='utf-8') as f_open:
source = f_open.read()
string = literal_eval("b'{}'".format(source)).decode('utf-8')
print(string) # 扎加拉

You can do some silly things like evaluating the string:
import ast
s = r'\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89'
print ast.literal_eval('"%s"' % s).decode('utf-8')
note use ast.literal_eval if you don't want attackers to gain access to your system :-P
Using this in your case would probably look something like:
with open('file') as file_handle:
data = ast.literal_eval('"%s"' % file.read()).decode('utf-8')
I think that the real issue here is likely that you have a file that contains strings representing bytes (rather than having a file that just stores the bytes themselves). So, fixing whatever code generated that file in the first place is probably a better bet. However, barring that, this is the next best thing that I could come up with ...

Solution in Python3 with only string manipulations and encoding conversions without evil eval :)
import binascii
str = '\\xe6\\x89\\x8e\\xe5\\x8a\\xa0\\xe6\\x8b\\x89'
str = str.replace('\\x', '') # str == 'e6898ee58aa0e68b89'
# we can use any encoding as long as it translate ascii as is,
# for example we can do str.encode('ascii') here
str = str.encode('utf8') # str == b'e6898ee58aa0e68b89'
str = binascii.a2b_hex(str) # str == b'\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89'
str = str.decode('utf8') # str == '扎加拉'
If you like an one-liner, then we can put it simply as:
binascii.a2b_hex(str.replace('\\x', '').encode()).decode('utf8')

at the end of day, what you get back is a string right? i would use string.replace method to convert double slash to single slash and add b prefix to make it work.

So there are several different ways to interpret having the data "in byte form." Let's assume you really do:
s = b'\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89'
The b prefix indicates those are bytes. Without getting into the
whole mess that is bytes vs codepoints/characters and the long differences
between Python 2 and 3, the b-prefixed string indicates those are intended
to be bytes (e.g. raw UTF-8 bytes).
Then just decode it, which converts UTF-8 encoding (which you already
have in the bytes, into true Unicode characters. In Python 2.7, e.g.:
print s.decode('utf-8')
yields:
扎加拉
One of your examples did an encode followed by a decode, which can only lead to sorrow and pain. If your variable holds true UTF-8 bytes, you only need the decode.
Update Based on discussion, it appears the data isn't really in UTF-8 bytes, but a string-serialized version of same. There are a lot of ways to get from string serial to bytes. Here's mine:
from struct import pack
def byteize(s):
"""
Given a backslash-escaped string serialization of bytes,
decode it into a genuine byte string.
"""
bvals = [int(s[i:i+2], 16) for i in range(2, len(s), 4)]
return pack(str(len(bvals)) + 'B', *bvals)
Then:
print byteize(s).decode('utf-8')
as before yields:
扎加拉
This byteize() isn't as general as the literal_eval()-based accepted answer, but %timeit benchmarking shows it to be about 33% faster on short strings. It could be further accelerated by swapping out range for xrange under Python 2. The literal_eval approach wins handily on long strings, however, given its lower-level nature.
100000 loops, best of 3: 6.19 µs per loop
100000 loops, best of 3: 8.3 µs per loop

Related

Python 3, send a list over socket [duplicate]

I've very recently migrated to Python 3.5.
This code was working properly in Python 2.7:
with open(fname, 'rb') as f:
lines = [x.strip() for x in f.readlines()]
for line in lines:
tmp = line.strip().lower()
if 'some-pattern' in tmp: continue
# ... code
But in 3.5, on the if 'some-pattern' in tmp: continue line, I get an error which says:
TypeError: a bytes-like object is required, not 'str'
I was unable to fix the problem using .decode() on either side of the in, nor could I fix it using
if tmp.find('some-pattern') != -1: continue
What is wrong, and how do I fix it?
You opened the file in binary mode:
with open(fname, 'rb') as f:
This means that all data read from the file is returned as bytes objects, not str. You cannot then use a string in a containment test:
if 'some-pattern' in tmp: continue
You'd have to use a bytes object to test against tmp instead:
if b'some-pattern' in tmp: continue
or open the file as a textfile instead by replacing the 'rb' mode with 'r'.
You can encode your string by using .encode()
Example:
'Hello World'.encode()
As the error describes, in order to write a string to a file you need to encode it to a byte-like object first, and encode() is encoding it to a byte-string.
Like it has been already mentioned, you are reading the file in binary mode and then creating a list of bytes. In your following for loop you are comparing string to bytes and that is where the code is failing.
Decoding the bytes while adding to the list should work. The changed code should look as follows:
with open(fname, 'rb') as f:
lines = [x.decode('utf8').strip() for x in f.readlines()]
The bytes type was introduced in Python 3 and that is why your code worked in Python 2. In Python 2 there was no data type for bytes:
>>> s=bytes('hello')
>>> type(s)
<type 'str'>
You have to change from wb to w:
def __init__(self):
self.myCsv = csv.writer(open('Item.csv', 'wb'))
self.myCsv.writerow(['title', 'link'])
to
def __init__(self):
self.myCsv = csv.writer(open('Item.csv', 'w'))
self.myCsv.writerow(['title', 'link'])
After changing this, the error disappears, but you can't write to the file (in my case). So after all, I don't have an answer?
Source: How to remove ^M
Changing to 'rb' brings me the other error: io.UnsupportedOperation: write
Use the encode() function along with the hardcoded string value given in a single quote.
Example:
file.write(answers[i] + '\n'.encode())
Or
line.split(' +++$+++ '.encode())
For this small example, adding the only b before
'GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n' solved my problem:
import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('www.py4inf.com', 80))
mysock.send(b'GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n')
while True:
data = mysock.recv(512)
if (len(data) < 1):
break
print (data);
mysock.close()
What does the 'b' character do in front of a string literal?
You opened the file in binary mode:
The following code will throw
a TypeError: a bytes-like object is required, not 'str'.
for line in lines:
print(type(line))# <class 'bytes'>
if 'substring' in line:
print('success')
The following code will work - you have to use the decode() function:
for line in lines:
line = line.decode()
print(type(line))# <class 'str'>
if 'substring' in line:
print('success')
Try opening your file as text:
with open(fname, 'rt') as f:
lines = [x.strip() for x in f.readlines()]
Additionally, here is a link for Python 3.x on the official page:
io — Core tools for working with streams.
And this is the open function: open
If you are really trying to handle it as a binary then consider encoding your string.
I got this error when I was trying to convert a char (or string) to bytes, the code was something like this with Python 2.7:
# -*- coding: utf-8 -*-
print(bytes('ò'))
This is the way of Python 2.7 when dealing with Unicode characters.
This won't work with Python 3.6, since bytes require an extra argument for encoding, but this can be little tricky, since different encoding may output different result:
print(bytes('ò', 'iso_8859_1')) # prints: b'\xf2'
print(bytes('ò', 'utf-8')) # prints: b'\xc3\xb2'
In my case I had to use iso_8859_1 when encoding bytes in order to solve the issue.
Summary
Python 2.x encouraged many bad habits WRT text handling. In particular, its type named str does not actually represent text per the Unicode standard (that type is unicode), and the default "string literal" in fact produces a sequence of raw bytes - with some convenience functions for treating it like a string, if you can get away with assuming a "code page" style encoding.
In 3.x, "string literals" now produce actual strings, and built-in functionality no longer does any implicit conversions between the two types. Thus, the same code now has a TypeError, because the literal and the variable are of incompatible types. To fix the problem, one of the values must be either replaced or converted, so that the types match.
The Python documentation has an extremely detailed guide to working with Unicode properly.
In the example in the question, the input file is processed as if it contains text. Therefore, the file should have been opened in a text mode in the first place. The only good reason the file would have been opened in binary mode even in 2.x is to avoid universal newline translation; in 3.x, this is done by specifying the newline keyword parameter when opening a file in text mode.
To read a file as text properly requires knowing a text encoding, which is specified in the code by (string) name. The encoding iso-8859-1 is a safe fallback; it interprets each byte separately, as representing one of the first 256 Unicode code points, in order (so it will never raise an exception due to invalid data). utf-8 is much more common as of the time of writing, but it does not accept arbitrary data. (However, in many cases, for English text, the distinction will not matter; both of those encodings, and many more, are supersets of ASCII.)
Thus:
with open(fname, 'r', newline='\n', encoding='iso-8859-1') as f:
lines = [x.strip() for x in f.readlines()]
# proceed as before
# If the results are wrong, take additional steps to ascertain the correct encoding
How the error is created when migrating from 2.x to 3.x
In 2.x, 'some-pattern' creates a str, i.e. a sequence of bytes that the programmer is then likely to pretend is text. The str type is the same as the bytes type, and different from the unicode type that properly represents text. Many methods are offered to treat this data as if it were text, but it is not a proper representation of text. The meaning of each value as a text character (the encoding) is assumed. (In order to enable the illusion of raw data as "text", there would sometimes be implicit conversions between the str and unicode types. However, this results in confusing errors of its own - such as getting UnicodeDecodeError from an attempt to encode, or vice-versa).
In 3.x, 'some-pattern' creates what is also called a str; but now str means the Unicode-using, properly-text-representing string type. (unicode is no longer used as a type name, and only bytes refers to the sequence-of-bytes type.) Some changes were made to bytes to dissociate it from the text-with-assumed-encoding interpretation (in particular, indexing into a bytes object now results in an int, rather than a 1-element bytes), but many strange legacy methods persist (including ones rarely used even with actual strings any more, like zfill).
Why this causes a problem
The data, tmp, is a bytes instance. It came from a binary source: in this case, a file opened with a 'b' file mode. In other cases, it could come from a raw network socket, a web request made with urllib or similar, or some other API call.
This means that it cannot do anything meaningful in combination with a string. The elements of a string are Unicode code points (i.e., abstractions that represent, for the most part, text characters, in a universal form that represents all world languages and many other symbols). The elements of a bytes are, well, bytes. (Specifically in 3.x, they are interpreted as unsigned integers ranging from 0 to 255 inclusive.)
When the code was migrated, the literal 'some-pattern' went from describing a bytes, to describing text. Thus, the code went from making a legal comparison (byte-sequence to byte-sequence), to making an illegal one (string to byte-sequence).
Fixing the problem
In order to operate on a string and a byte-sequence - whether it's checking for equality with ==, lexicographic comparison with <, substring search with in, concatenation with +, or anything else - either the string must be converted to a byte-sequence, or vice-versa. In general, only one of these will be the correct, sensible answer, and it will depend on the context.
Fixing the source
Sometimes, one of the values can be seen to be "wrong" in the first place. For example, if reading the file was intended to result in text, then it should have been opened in a text mode. In 3.x, the file encoding can simply be passed as an encoding keyword argument to open, and conversion to Unicode is handled seamlessly without having to feed a binary file to an explicit translation step (thus, universal newline handling still takes place seamlessly).
In the case of the original example, that could look like:
with open(fname, 'r') as f:
lines = [x.strip() for x in f.readlines()]
This example assumes a platform-dependent default encoding for the file. This will normally work for files that were created in straightforward ways, on the same computer. In the general case, however, the encoding of the data must be known in order to work with it properly.
If the encoding is known to be, for example, UTF-8, that is trivially specified:
with open(fname, 'r', encoding='utf-8') as f:
lines = [x.strip() for x in f.readlines()]
Similarly, a string literal that should have been a bytes literal is simply missing a prefix: to make the bytes sequence representing integer values [101, 120, 97, 109, 112, 108, 101] (i.e., the ASCII values of the letters example), write the bytes literal b'example', rather than the string literal `'example'). Similarly the other way around.
In the case of the original example, that would look like:
if b'some-pattern' in tmp:
There is a safeguard built in to this: the bytes literal syntax only allows ASCII characters, so something like b'ëxãmþlê' will be caught as a SyntaxError, regardless of the encoding of the source file (since it is not clear which byte values are meant; in the old implied-encoding schemes, the ASCII range was well established, but everything else was up in the air.) Of course, bytes literals with elements representing values 128..255 can still be written by using \x escaping for those values: for example, b'\xebx\xe3m\xfel\xea' will produce a byte-sequence corresponding to the text ëxãmþlê in Latin-1 (ISO 8859-1) encoding.
Converting, when appropriate
Conversion between byte-sequences and text is only possible when an encoding has been determined. It has always been so; we just used to assume an encoding locally, and then mostly ignore that we had done so. (Programmers in places like East Asia have been more aware of the problem historically, because they commonly need to work with scripts that have more than 256 distinct symbols, and thus their text requires multi-byte encodings.)
In 3.x, because there is no pressure to be able to treat byte-sequences implicitly as text with an assumed encoding, there are therefore no implicit conversion steps behind the scenes. This means that understanding the API is straightforward: Bytes are raw data; therefore, they are used to encode text, which is an abstraction. Therefore, the .encode() method is provided by str (which represents text), in order to encode text into raw data. Similarly, the .decode() method is provided by bytes (which represents a byte-sequence), in order to decode raw data into text.
Applying these to the example code, again supposing UTF-8 encoding is appropriate, gives:
if 'some-pattern'.encode('utf-8') in tmp:
and
if 'some-pattern' in tmp.decode('utf-8'):

Python, Encoding output to UTF-8

I have a definition that builds a string composed of UTF-8 encoded characters. The output files are opened using 'w+', "utf-8" arguments.
However, when I try to x.write(string) I get the UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 1: ordinal not in range(128)
I assume this is because normally for example you would do `print(u'something'). But I need to use a variable and the quotations in u'_' negate that...
Any suggestions?
EDIT: Actual code here:
source = codecs.open("actionbreak/" + target + '.csv','r', "utf-8")
outTarget = codecs.open("actionbreak/" + newTarget, 'w+', "utf-8")
x = str(actionT(splitList[0], splitList[1]))
outTarget.write(x)
Essentially all this is supposed to be doing is building me a large amount of strings that look similar to this:
[日木曜 Deliverables]= CASE WHEN things = 11
THEN C ELSE 0 END
Are you using codecs.open()? Python 2.7's built-in open() does not support a specific encoding, meaning you have to manually encode non-ascii strings (as others have noted), but codecs.open() does support that and would probably be easier to drop in than manually encoding all the strings.
As you are actually using codecs.open(), going by your added code, and after a bit of looking things up myself, I suggest attempting to open the input and/or output file with encoding "utf-8-sig", which will automatically handle the BOM for UTF-8 (see http://docs.python.org/2/library/codecs.html#encodings-and-unicode, near the bottom of the section) I would think that would only matter for the input file, but if none of those combinations (utf-8-sig/utf-8, utf-8/utf-8-sig, utf-8-sig/utf-8-sig) work, then I believe the most likely situation would be that your input file is encoded in a different Unicode format with BOM, as Python's default UTF-8 codec interprets BOMs as regular characters so the input would not have an issue but output could.
Just noticed this, but... when you use codecs.open(), it expects a Unicode string, not an encoded one; try x = unicode(actionT(splitList[0], splitList[1])).
Your error can also occur when attempting to decode a unicode string (see http://wiki.python.org/moin/UnicodeEncodeError), but I don't think that should be happening unless actionT() or your list-splitting does something to the Unicode strings that causes them to be treated as non-Unicode strings.
In python 2.x there are two types of string: byte string and unicode string. First one contains bytes and last one - unicode code points. It is easy to determine, what type of string it is - unicode string starts with u:
# byte string
>>> 'abc'
'abc'
# unicode string:
>>> u'abc абв'
u'abc \u0430\u0431\u0432'
'abc' chars are the same, because the are in ASCII range. \u0430 is a unicode code point, it is out of ASCII range. "Code point" is python internal representation of unicode points, they can't be saved to file. It is needed to encode them to bytes first. Here how encoded unicode string looks like (as it is encoded, it becomes a byte string):
>>> s = u'abc абв'
>>> s.encode('utf8')
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
This encoded string now can be written to file:
>>> s = u'abc абв'
>>> with open('text.txt', 'w+') as f:
... f.write(s.encode('utf8'))
Now, it is important to remember, what encoding we used when writing to file. Because to be able to read the data, we need to decode the content. Here what data looks like without decoding:
>>> with open('text.txt', 'r') as f:
... content = f.read()
>>> content
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
You see, we've got encoded bytes, exactly the same as in s.encode('utf8'). To decode it is needed to provide coding name:
>>> content.decode('utf8')
u'abc \u0430\u0431\u0432'
After decode, we've got back our unicode string with unicode code points.
>>> print content.decode('utf8')
abc абв
xgord is right, but for further edification it's worth noting exactly what \ufeff means. It's known as a BOM or a byte order mark and basically it's a callback to the early days of unicode when people couldn't agree which way they wanted their unicode to go. Now all unicode documents are prefaced with either an \ufeff or an \uffef depending on which order they decide to arrange their bytes in.
If you hit an error on those characters in the first location you can be sure the issue is that you are not trying to decode it as utf-8, and the file is probably still fine.

Replacing = with '\x' and then decoding in python

I fetched the subject of an email message using python modules and received string
'=D8=B3=D9=84=D8=A7=D9=85_=DA=A9=D8=AC=D8=A7=D8=A6=DB=8C?='
I know the string is encoded in 'utf-8'. Python has a method called on strings to decode such strings. But to use the method I needed to replace = sign with \x string. By manual interchange and then printing the decoded resulting string, I get the string سلام_کجائی which is exactly what I want. The question is how I can do the interchange automatically? The answer seems harder than just simple usage of functions on strings like replace function.
Below I brought the code I used after manual operation?
r='\xD8\xB3\xD9\x84\xD8\xA7\xD9\x85_\xDA\xA9\xD8\xAC\xD8\xA7\xD8\xA6\xDB\x8C'
print r.decode('utf-8')
I would appreciate any workable idea.
Just decode it from quoted-printable to get utf8-encoded bytestring:
In [35]: s = '=D8=B3=D9=84=D8=A7=D9=85_=DA=A9=D8=AC=D8=A7=D8=A6=DB=8C?='
In [36]: s.decode('quoted-printable')
Out[36]: '\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85_\xda\xa9\xd8\xac\xd8\xa7\xd8\xa6\xdb\x8c?'
Then, if needed, from utf-8 to unicode:
In [37]: s.decode('quoted-printable').decode('utf8')
Out[37]: u'\u0633\u0644\u0627\u0645_\u06a9\u062c\u0627\u0626\u06cc?'
In [39]: print s.decode('quoted-printable')
سلام_کجائی?
This sort of encoding is known as quoted-printable. There is a Python module for performing encoding and decoding.
You're right that it's just a pure quoting of binary strings, so you need to apply UTF-8 decoding afterwards. (Assuming the string is in UTF-8, of course. But that looks correct although I don't know the language.)
import quopri
print quopri.decodestring( "'=D8=B3=D9=84=D8=A7=D9=85_=DA=A9=D8=AC=D8=A7=D8=A6=DB=8C?='" ).decode( "utf-8" )

How do I convert a unicode to a string at the Python level?

The following unicode and string can exist on their own if defined explicitly:
>>> value_str='Andr\xc3\xa9'
>>> value_uni=u'Andr\xc3\xa9'
If I only have u'Andr\xc3\xa9' assigned to a variable like above, how do I convert it to 'Andr\xc3\xa9' in Python 2.5 or 2.6?
EDIT:
I did the following:
>>> value_uni.encode('latin-1')
'Andr\xc3\xa9'
which fixes my issue. Can someone explain to me what exactly is happening?
You seem to have gotten your encodings muddled up. It seems likely that what you really want is u'Andr\xe9' which is equivalent to 'André'.
But what you have seems to be a UTF-8 encoding that has been incorrectly decoded. You can fix it by converting the unicode string to an ordinary string. I'm not sure what the best way is, but this seems to work:
>>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9')
'Andr\xc3\xa9'
Then decode it correctly:
>>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9').decode('utf8')
u'Andr\xe9'
Now it is in the correct format.
However instead of doing this, if possible you should try to work out why the data has been incorrectly encoded in the first place, and fix that problem there.
You asked (in a comment) """That is what's puzzling me. How did it go from it original accented to what it is now? When you say double encoding with utf8 and latin1, is that a total of 3 encodings(2 utf8 + 1 latin1)? What's the order of the encode from the original state to the current one?"""
In the answer by Mark Byers, he says """what you have seems to be a UTF-8 encoding that has been incorrectly decoded""". You have accepted his answer. But you are still puzzled? OK, here's the blow-by-blow description:
Note: All strings will be displayed using (implicitly) repr(). unicodedata.name() will be used to verify the contents. That way, variations in console encoding cannot confuse interpretation of the strings.
Initial state: you have a unicode object that you have named u1. It contains e-acute:
>>> u1 = u'\xe9'
>>> import unicodedata as ucd
>>> ucd.name(u1)
'LATIN SMALL LETTER E WITH ACUTE'
You encode u1 as UTF-8 and name the result s:
>>> s = u1.encode('utf8')
>>> s
'\xc3\xa9'
You decode s using latin1 -- INCORRECTLY; s was encoded using utf8, NOT latin1. The result is meaningless rubbish.
>>> u2 = s.decode('latin1')
>>> u2
u'\xc3\xa9'
>>> ucd.name(u2[0]); ucd.name(u2[1])
'LATIN CAPITAL LETTER A WITH TILDE'
'COPYRIGHT SIGN'
>>>
Please understand: unicode_object.encode('x').decode('y) when x != y is normally [see note below] a nonsense; it will raise an exception if you are lucky; if you are unlucky it will silently create gibberish. Also please understand that silently creating gibberish is not a bug -- there is no general way that Python (or any other language) can detect that a nonsense has been committed. This applies particularly when latin1 is involved, because all 256 codepoints map 1 to 1 with the first 256 Unicode codepoints, so it is impossible to get a UnicodeDecodeError from str_object.decode('latin1').
Of course, abnormally (one hopes that it's abnormal) you may need to reverse out such a nonsense by doing gibberish_unicode_object.encode('y').decode('x') as suggested in various answers to your question.
If you have u'Andr\xc3\xa9', that is a Unicode string that was decoded from a byte string with the wrong encoding. The correct encoding is UTF-8. To convert it back to a byte string so you can decode it correctly, you can use the trick you discovered. The first 256 code points of Unicode are a 1:1 mapping with ISO-8859-1 (alias latin1) encoding. So:
>>> u'Andr\xc3\xa9'.encode('latin1')
'Andr\xc3\xa9'
Now it is a byte string that can be decoded correctly with utf8:
>>> 'Andr\xc3\xa9'.decode('utf8')
u'Andr\xe9'
>>> print 'Andr\xc3\xa9'.decode('utf8')
André
In one step:
>>> print u'Andr\xc3\xa9'.encode('latin1').decode('utf8')
André
value_uni.encode('utf8') or whatever encoding you need.
See http://docs.python.org/library/stdtypes.html#str.encode
The OP is not converting to ascii nor utf-8. That's why the suggested encode methods won't work. Try this:
v = u'Andr\xc3\xa9'
s = ''.join(map(lambda x: chr(ord(x)),v))
The chr(ord(x)) business gets the numeric value of the unicode character (which better fit in one byte for your application), and the ''.join call is an idiom that converts a list of ints back to an ordinary string. No doubt there is a more elegant way.
Simplified explanation. The str type is able to hold only characters from 0-255 range. If you want to store unicode (which can contain characters from much wider range) in str you first have to encode unicode to format suitable for str, for example UTF-8.
To do this call method encode on your str object and as an argument give desired encoding, for example this_is_str = value_uni.encode('utf-8').
You can read longer and more in-depth (and language agnostic) article on Unicode handling here: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Another excellent article (this time Python-specific): Unicode HOWTO
It seems like
str(value_uni)
should work... at least, it did when I tried it.
EDIT: Turns out that this only works because my system's default encoding is, as far as I can tell, ISO-8859-1 (Latin-1). So for a platform-independent version of this, try
value_uni.encode('latin1')

How do I convert a file's format from Unicode to ASCII using Python?

I use a 3rd party tool that outputs a file in Unicode format. However, I prefer it to be in ASCII. The tool does not have settings to change the file format.
What is the best way to convert the entire file format using Python?
You can convert the file easily enough just using the unicode function, but you'll run into problems with Unicode characters without a straight ASCII equivalent.
This blog recommends the unicodedata module, which seems to take care of roughly converting characters without direct corresponding ASCII values, e.g.
>>> title = u"Klüft skräms inför på fédéral électoral große"
is typically converted to
Klft skrms infr p fdral lectoral groe
which is pretty wrong. However, using the unicodedata module, the result can be much closer to the original text:
>>> import unicodedata
>>> unicodedata.normalize('NFKD', title).encode('ascii','ignore')
'Kluft skrams infor pa federal electoral groe'
I think this is a deeper issue than you realize. Simply changing the file from Unicode into ASCII is easy, however, getting all of the Unicode characters to translate into reasonable ASCII counterparts (many letters are not available in both encodings) is another.
This Python Unicode tutorial may give you a better idea of what happens to Unicode strings that are translated to ASCII: http://www.reportlab.com/i18n/python_unicode_tutorial.html
Here's a useful quote from the site:
Python 1.6 also gets a "unicode"
built-in function, to which you can
specify the encoding:
> >>> unicode('hello') u'hello'
> >>> unicode('hello', 'ascii') u'hello'
> >>> unicode('hello', 'iso-8859-1') u'hello'
> >>>
All three of these return the same
thing, since the characters in 'Hello'
are common to all three encodings.
Now let's encode something with a
European accent, which is outside of
ASCII. What you see at a console may
depend on your operating system
locale; Windows lets me type in
ISO-Latin-1.
> >>> a = unicode('André','latin-1')
> >>> a u'Andr\202'
If you can't type an acute letter e,
you can enter the string 'Andr\202',
which is unambiguous.
Unicode supports all the common
operations such as iteration and
splitting. We won't run over them
here.
By the way, these is a linux command iconv to do this kind of job.
iconv -f utf8 -t ascii <input.txt >output.txt
Here's some simple (and stupid) code to do encoding translation. I'm assuming (but you shouldn't) that the input file is in UTF-16 (Windows calls this simply 'Unicode').
input_codec = 'UTF-16'
output_codec = 'ASCII'
unicode_file = open('filename')
unicode_data = unicode_file.read().decode(input_codec)
ascii_file = open('new filename', 'w')
ascii_file.write(unicode_data.write(unicode_data.encode(output_codec)))
Note that this will not work if there are any characters in the Unicode file that are not also ASCII characters. You can do the following to turn unrecognized characters into '?'s:
ascii_file.write(unicode_data.write(unicode_data.encode(output_codec, 'replace')))
Check out the docs for more simple choices. If you need to do anything more sophisticated, you may wish to check out The UNICODE Hammer at the Python Cookbook.
Like this:
uc = open(filename).read().decode('utf8')
ascii = uc.decode('ascii')
Note, however, that this will fail with a UnicodeDecodeError exception if there are any characters that can't be converted to ASCII.
EDIT: As Pete Karl just pointed out, there is no one-to-one mapping from Unicode to ASCII. So some characters simply can't be converted in an information-preserving way. Moreover, standard ASCII is more or less a subset of UTF-8, so you don't really even need to do any decoding.
For my problem where I just wanted to skip the Non-ascii characters and just output only ascii output, the below solution worked really well:
import unicodedata
input = open(filename).read().decode('UTF-16')
output = unicodedata.normalize('NFKD', input).encode('ASCII', 'ignore')
It's important to note that there is no 'Unicode' file format. Unicode can be encoded to bytes in several different ways. Most commonly UTF-8 or UTF-16. You'll need to know which one your 3rd-party tool is outputting. Once you know that, converting between different encodings is pretty easy:
in_file = open("myfile.txt", "rb")
out_file = open("mynewfile.txt", "wb")
in_byte_string = in_file.read()
unicode_string = bytestring.decode('UTF-16')
out_byte_string = unicode_string.encode('ASCII')
out_file.write(out_byte_string)
out_file.close()
As noted in the other replies, you're probably going to want to supply an error handler to the encode method. Using 'replace' as the error handler is simple, but will mangle your text if it contains characters that cannot be represented in ASCII.
As other posters have noted, ASCII is a subset of unicode.
However if you:
have a legacy app
you don't control the code for that app
you're sure your input falls into the ASCII subset
Then the example below shows how to do it:
mystring = u'bar'
type(mystring)
<type 'unicode'>
myasciistring = (mystring.encode('ASCII'))
type(myasciistring)
<type 'str'>

Categories

Resources