Struggling with utf-16 encoding/decoding

Struggling with utf-16 encoding/decoding - python

I'm parsing a document that have some UTF-16 encoded string.
I have a byte string that contains the following:
my_var = b'\xc3\xbe\xc3\xbf\x004\x004\x000\x003\x006\x006\x000\x006\x00-\x001\x000\x000\x003\x008\x000\x006\x002\x002\x008\x005'
When converting to utf-8, I get the following output:
print(my_var.decode('utf-8'))
#> þÿ44036606-10038062285
The first two chars þÿ indicate it's a BOM for UTF-16BE, as indicated on Wikipedia
But, what I don't understand is that if I try the UTF16 BOM like this:
if value.startswith(codecs.BOM_UTF16_BE)
This returns false. In fact, printing codecs.BOM_UTF16_BE doesn't show the same results:
print(codecs.BOM_UTF16_BE)
#> b'\xfe\xff'
Why is that? I'm suspecting some enconding issues on the higher end but not sure how to fix that one.
There are already a few mentions of how to decode UTF-16 on Stackoverflow (like this one), and they all say one thing: Decode using utf-16 and Python will handle the BOM.
... But that doesn't work for me.
print(my_var.decode('utf-16')
#> 뻃뿃㐀㐀　㌀㘀㘀　㘀ⴀ㄀　　㌀㠀　㘀㈀㈀㠀㔀
But with UTF-16BE:
print(my_var.decode('utf-16be')
#> 쎾쎿44036606-10038062285
(the bom is not removed)
And with UTF-16LE:
print(my_var.decode('utf-16le')
#> 뻃뿃㐀㐀　㌀㘀㘀　㘀ⴀ㄀　　㌀㠀　㘀㈀㈀㠀㔀
So, for a reason I can't explain, using only .decode('UTF-16') doesn't work for me. Why?
UPDATE
The original source string isn't the one I mentioned, but this one:
source = '\376\377\0004\0004\0000\0003\0006\0006\0000\0006\000-\0001\0000\0000\0003\0008\0000\0006\0002\0002\0008\0005'
I converted it using the following:
def decode_8bit(cls, match):
value = match.group().replace(b'\\', b'')
return chr(int(value, base=8)).encode('utf-8')
my_var = re.sub(b'\\\\[0-9]{1,3}', decode_8bit, source)
Maybe I did something wrong here?

It is right that þÿ indicates the BOM for UTF-16BE, if you use the CP1252 encoding.
The difference is the following:
Your first byte is 0xC3, which is 11000011 in binary.
UTF-8:
The first two bits are set and indicate that your UTF-8 char is 2 byte long.
Getting 0xC3 0xBE for your first character, which is þ for UTF-8.
CP1252
CP1252 is always 1 byte long and returns Ã for 0xC3.
But if you lookup 0xC3 in your linked BOM list you won't find any matching Encoding.
Looks like there wasn't a BOM in the first place.
Using the default encoding is probably the way to go, which is UTF-16LE for Windows.
Edit after original source added
Your encoding to UTF-8 destorys the BOM because it is not valid UTF-8. Try to avoid decoding and pass on a list of bytes.
OPs solution:
bytes(int(value, base=8))

As per requested by #Tomalak and #Hyarus, here's the reason of my issue:
When decoding the 8bit values, I was returning them as UTF-8 encoded:
def decode_8bit(cls, match):
value = match.group().replace(b'\\', b'')
return chr(int(value, base=8)).encode('utf-8')
my_var = re.sub(b'\\\\[0-9]{1,3}', decode_8bit, source)
This was messing the data returned since it was not encoded using UTF-8 (duh). So the correct code should have been:
def decode_8bit(cls, match):
value = match.group().replace(b'\\', b'')
return bytes(int(value, base=8))
my_var = re.sub(b'\\\\[0-9]{1,3}', decode_8bit, source)
Hope that helps someone else... Good luck with encoding! :/

Related

Python: Decoding byte array

can someone tell me how to decode this bytearray:
bytearray(b"~QF\xc2\x13\x04\xca=`\xc3\xef\xfd\x8bc\xb3\xcf\'V\xa3n\xeb\xccX\x97\xb2\xe6\xe0\xca\x12\x85\xac\x8b\xca\xd3hj\xb3\xc0\xedF\x84\x91\x83\t\xfe:\xdd#\xeb?O\xf8\xc2\xf1,\x8d\xfc\x1ag\xddZ\xd0\x14\xcc")
I've tried everything I can think of. Different ways of value.decode() where I used different encodings (utf-8, utf-16-le, windows-1252, ascii, the list goes on) and base64 encodings (last I tried was base64.b64decode since it truncates the padding).
I've an example of the decoding in an NodeJS application:
Buffer.from(bluetoothBarrier.getReceivedMessage()).toString('base64')
But I don't know how to replicate this in Python...

Buffer.from(bluetoothBarrier.getReceivedMessage()).toString('base64')
This appears to be converting the received message into a base64 string.
If that's all you want to do the correct implementation would be:
base64_bytes = base64.b64encode(bytesarray)
base64_string = base64_bytes.decode('utf-8')
This produces base64 encode string:
'flFGwhMEyj1gw+/9i2OzzydWo27rzFiXsubgyhKFrIvK02hqs8DtRoSRgwn+Ot1A6z9P+MLxLI38GmfdWtAUzA=='

Python, Encoding output to UTF-8

I have a definition that builds a string composed of UTF-8 encoded characters. The output files are opened using 'w+', "utf-8" arguments.
However, when I try to x.write(string) I get the UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 1: ordinal not in range(128)
I assume this is because normally for example you would do `print(u'something'). But I need to use a variable and the quotations in u'_' negate that...
Any suggestions?
EDIT: Actual code here:
source = codecs.open("actionbreak/" + target + '.csv','r', "utf-8")
outTarget = codecs.open("actionbreak/" + newTarget, 'w+', "utf-8")
x = str(actionT(splitList[0], splitList[1]))
outTarget.write(x)
Essentially all this is supposed to be doing is building me a large amount of strings that look similar to this:
[日木曜 Deliverables]= CASE WHEN things = 11
THEN C ELSE 0 END

Are you using codecs.open()? Python 2.7's built-in open() does not support a specific encoding, meaning you have to manually encode non-ascii strings (as others have noted), but codecs.open() does support that and would probably be easier to drop in than manually encoding all the strings.
As you are actually using codecs.open(), going by your added code, and after a bit of looking things up myself, I suggest attempting to open the input and/or output file with encoding "utf-8-sig", which will automatically handle the BOM for UTF-8 (see http://docs.python.org/2/library/codecs.html#encodings-and-unicode, near the bottom of the section) I would think that would only matter for the input file, but if none of those combinations (utf-8-sig/utf-8, utf-8/utf-8-sig, utf-8-sig/utf-8-sig) work, then I believe the most likely situation would be that your input file is encoded in a different Unicode format with BOM, as Python's default UTF-8 codec interprets BOMs as regular characters so the input would not have an issue but output could.
Just noticed this, but... when you use codecs.open(), it expects a Unicode string, not an encoded one; try x = unicode(actionT(splitList[0], splitList[1])).
Your error can also occur when attempting to decode a unicode string (see http://wiki.python.org/moin/UnicodeEncodeError), but I don't think that should be happening unless actionT() or your list-splitting does something to the Unicode strings that causes them to be treated as non-Unicode strings.

In python 2.x there are two types of string: byte string and unicode string. First one contains bytes and last one - unicode code points. It is easy to determine, what type of string it is - unicode string starts with u:
# byte string
>>> 'abc'
'abc'
# unicode string:
>>> u'abc абв'
u'abc \u0430\u0431\u0432'
'abc' chars are the same, because the are in ASCII range. \u0430 is a unicode code point, it is out of ASCII range. "Code point" is python internal representation of unicode points, they can't be saved to file. It is needed to encode them to bytes first. Here how encoded unicode string looks like (as it is encoded, it becomes a byte string):
>>> s = u'abc абв'
>>> s.encode('utf8')
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
This encoded string now can be written to file:
>>> s = u'abc абв'
>>> with open('text.txt', 'w+') as f:
... f.write(s.encode('utf8'))
Now, it is important to remember, what encoding we used when writing to file. Because to be able to read the data, we need to decode the content. Here what data looks like without decoding:
>>> with open('text.txt', 'r') as f:
... content = f.read()
>>> content
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
You see, we've got encoded bytes, exactly the same as in s.encode('utf8'). To decode it is needed to provide coding name:
>>> content.decode('utf8')
u'abc \u0430\u0431\u0432'
After decode, we've got back our unicode string with unicode code points.
>>> print content.decode('utf8')
abc абв

xgord is right, but for further edification it's worth noting exactly what \ufeff means. It's known as a BOM or a byte order mark and basically it's a callback to the early days of unicode when people couldn't agree which way they wanted their unicode to go. Now all unicode documents are prefaced with either an \ufeff or an \uffef depending on which order they decide to arrange their bytes in.
If you hit an error on those characters in the first location you can be sure the issue is that you are not trying to decode it as utf-8, and the file is probably still fine.

string decode method error in python

I have a function like this:
def convert_to_unicode(data):
row = {}
if data == None:
return data
try:
for key, val in data.items():
if isinstance(val, str):
row[key] = unicode(val.decode('utf8'))
else:
row[key] = val
return row
except Exception, ex:
log.debug(ex)
to which I feed a result set (got using MySQLdb.cursors.DictCursor) row by row to transform all the string values to unicode (example {'column_1':'XXX'} becomes {'column_1':u'XXX'}).
Problem is one of the rows has a value like {'column_1':'Gabriel García Márquez'}
and it does not get transformed. it throws this error:
'utf8' codec can't decode byte 0xed in position 12: invalid continuation byte
When I searched for this it seems that this has to do with ascii encoding.
The solutions i tried are:
adding # -*- coding: utf-8 -*- at the beginning of my file ... does not help
changing the line row[key] = unicode(val.decode('utf8')) to row[key] = unicode(val.decode('utf8', 'ignore')) ... as expected it ignores the non-ascii character and returns {'column_1':u'Gabriel Garca Mrquez'}
changing the line row[key] = unicode(val.decode('utf8')) to row[key] = unicode(val.decode('latin-1')) ... Does the job but I am afraid it will support only West Europe characters (as per Here )
Can anybody point me towards a right direction please.

Firstly:
The data you're getting in your result set is clearly latin-1 encoded, or you wouldn't be observing this behavior. It is entirely correct that trying to decode a latin-1-encoded byte string as though it were utf-8-encoded blows up in your face. Once you have a latin-1-encoded byte string foo, if you want to convert it to the unicode type, foo.decode('latin1') is the right thing to do.
I noticed the expression unicode(val.decode('utf8')) in your code. This is equivalent to just val.decode('utf8'); calling the .decode method of a byte string converts it to unicode, so you're calling unicode() on a unicode string, which just returns the unicode string.
Secondly:
Your real problem here - if you want to be able to deal with characters not included in the character set supported by the latin-1 encoding - is not with Python's string types, per se, so much as it is with the MySQLdb library. I don't know this problem in intimate detail, but as I understand it, in ancient versions of MySQL, the default encoding used by MySQL databases was latin-1, but now it is utf-8 (and has been for many years). The MySQLdb library, however, still by default establishes latin-1-encoded connections with the database. There are literally dozens of StackOverflow questions relating to MySQL, Python, and string encoding, and while I don't fully understand them, one easy-to-use solution to all such problems that seems to work for people is this one:
http://www.dasprids.de/blog/2007/12/17/python-mysqldb-and-utf-8
I wish I could give you a more comprehensive and confident answer on the MySQLdb issue, but I've never even used MySQL and I don't want to risk posting anything untrue. Perhaps someone can come along and provide more detail. Nonetheless, I hope this helps you.

Your third solution - changing the encoding to "latin-1" - is correct. Your input data is encoded as Latin-1, so that's what you have to decode it as. Unless someone somewhere did something very silly, it should be impossible for that input data to contain invalid characters for that encoding.

UnicodeDecodeError when using a Python string handling function

I'm doing this:
word.rstrip(s)
Where word and s are strings containing unicode characters.
I'm getting this:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)
There's a bug report where this error happens on some Windows Django systems. However, my situation seems unrelated to that case.
What could be the problem?
EDIT: The code is like this:
def Strip(word):
for s in suffixes:
return word.rstrip(s)

The issue is that s is a bytestring, while word is a unicode string - so, Python tries to turn s into a unicode string so that the rstrip makes sense. The issue is, it assumes s is encoded in ASCII, which it clearly isn't (since it contains a character outside the ASCII range).
So, since you intitialise it as a literal, it is very easy to turn it into a unicode string by putting a u in front of it:
suffixes = [u'ি']
Will work. As you add more suffixes, you'll need the u in front of all of them individually.

I guess this happens because of implicit conversion in python2.
It's explained in this document, but I recommend you to read the whole presentation about handling unicode in python 2 and 3 (and why python3 is better ;-))
So, I think the solution to your problem would be to force the decoding of strings as utf8 before striping.
Something like :
def Strip(word):
word = word.decode("utf8")
for s in suffixes:
return word.rstrip(s.decode("utf8")
Second try :
def Strip(word):
if type(word) == str:
word = word.decode("utf8")
for s in suffixes:
if type(s) == str:
s = s.decode("utf8")
return word.rstrip(s)

Convert Unicode to ASCII without errors in Python

My code just scrapes a web page, then converts it to Unicode.
html = urllib.urlopen(link).read()
html.encode("utf8","ignore")
self.response.out.write(html)
But I get a UnicodeDecodeError:
Traceback (most recent call last):
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__
handler.get(*groups)
File "/Users/greg/clounce/main.py", line 55, in get
html.encode("utf8","ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)
I assume that means the HTML contains some wrongly-formed attempt at Unicode somewhere. Can I just drop whatever code bytes are causing the problem instead of getting an error?

>>> u'aあä'.encode('ascii', 'ignore')
'a'
Decode the string you get back, using either the charset in the the appropriate meta tag in the response or in the Content-Type header, then encode.
The method encode(encoding, errors) accepts custom handlers for errors. The default values, besides ignore, are:
>>> u'aあä'.encode('ascii', 'replace')
b'a??'
>>> u'aあä'.encode('ascii', 'xmlcharrefreplace')
b'aあä'
>>> u'aあä'.encode('ascii', 'backslashreplace')
b'a\\u3042\\xe4'
See https://docs.python.org/3/library/stdtypes.html#str.encode

As an extension to Ignacio Vazquez-Abrams' answer
>>> u'aあä'.encode('ascii', 'ignore')
'a'
It is sometimes desirable to remove accents from characters and print the base form. This can be accomplished with
>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'aあä').encode('ascii', 'ignore')
'aa'
You may also want to translate other characters (such as punctuation) to their nearest equivalents, for instance the RIGHT SINGLE QUOTATION MARK unicode character does not get converted to an ascii APOSTROPHE when encoding.
>>> print u'\u2019'
’
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
>>> u'\u2019'.encode('ascii', 'ignore')
''
# Note we get an empty string back
>>> u'\u2019'.replace(u'\u2019', u'\'').encode('ascii', 'ignore')
"'"
Although there are more efficient ways to accomplish this. See this question for more details Where is Python's "best ASCII for this Unicode" database?

2018 Update:
As of February 2018, using compressions like gzip has become quite popular (around 73% of all websites use it, including large sites like Google, YouTube, Yahoo, Wikipedia, Reddit, Stack Overflow and Stack Exchange Network sites).
If you do a simple decode like in the original answer with a gzipped response, you'll get an error like or similar to this:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: unexpected code byte
In order to decode a gzpipped response you need to add the following modules (in Python 3):
import gzip
import io
Note: In Python 2 you'd use StringIO instead of io
Then you can parse the content out like this:
response = urlopen("https://example.com/gzipped-ressource")
buffer = io.BytesIO(response.read()) # Use StringIO.StringIO(response.read()) in Python 2
gzipped_file = gzip.GzipFile(fileobj=buffer)
decoded = gzipped_file.read()
content = decoded.decode("utf-8") # Replace utf-8 with the source encoding of your requested resource
This code reads the response, and places the bytes in a buffer. The gzip module then reads the buffer using the GZipFile function. After that, the gzipped file can be read into bytes again and decoded to normally readable text in the end.
Original Answer from 2010:
Can we get the actual value used for link?
In addition, we usually encounter this problem here when we are trying to .encode() an already encoded byte string. So you might try to decode it first as in
html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")
As an example:
html = '\xa0'
encoded_str = html.encode("utf8")
Fails with
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
While:
html = '\xa0'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")
Succeeds without error. Do note that "windows-1252" is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read() to what applies to the content you retrieved.
Another problem I see there is that the .encode() string method returns the modified string and does not modify the source in place. So it's kind of useless to have self.response.out.write(html) as html is not the encoded string from html.encode (if that is what you were originally aiming for).
As Ignacio suggested, check the source webpage for the actual encoding of the returned string from read(). It's either in one of the Meta tags or in the ContentType header in the response. Use that then as the parameter for .decode().
Do note however that it should not be assumed that other developers are responsible enough to make sure the header and/or meta character set declarations match the actual content. (Which is a PITA, yeah, I should know, I was one of those before).

Use unidecode - it even converts weird characters to ascii instantly, and even converts Chinese to phonetic ascii.
$ pip install unidecode
then:
>>> from unidecode import unidecode
>>> unidecode(u'北京')
'Bei Jing'
>>> unidecode(u'Škoda')
'Skoda'

I use this helper function throughout all of my projects. If it can't convert the unicode, it ignores it. This ties into a django library, but with a little research you could bypass it.
from django.utils import encoding
def convert_unicode_to_string(x):
"""
>>> convert_unicode_to_string(u'ni\xf1era')
'niera'
"""
return encoding.smart_str(x, encoding='ascii', errors='ignore')
I no longer get any unicode errors after using this.

For broken consoles like cmd.exe and HTML output you can always use:
my_unicode_string.encode('ascii','xmlcharrefreplace')
This will preserve all the non-ascii chars while making them printable in pure ASCII and in HTML.
WARNING: If you use this in production code to avoid errors then most likely there is something wrong in your code. The only valid use case for this is printing to a non-unicode console or easy conversion to HTML entities in an HTML context.
And finally, if you are on windows and use cmd.exe then you can type chcp 65001 to enable utf-8 output (works with Lucida Console font). You might need to add myUnicodeString.encode('utf8').

You wrote """I assume that means the HTML contains some wrongly-formed attempt at unicode somewhere."""
The HTML is NOT expected to contain any kind of "attempt at unicode", well-formed or not. It must of necessity contain Unicode characters encoded in some encoding, which is usually supplied up front ... look for "charset".
You appear to be assuming that the charset is UTF-8 ... on what grounds? The "\xA0" byte that is shown in your error message indicates that you may have a single-byte charset e.g. cp1252.
If you can't get any sense out of the declaration at the start of the HTML, try using chardet to find out what the likely encoding is.
Why have you tagged your question with "regex"?
Update after you replaced your whole question with a non-question:
html = urllib.urlopen(link).read()
# html refers to a str object. To get unicode, you need to find out
# how it is encoded, and decode it.
html.encode("utf8","ignore")
# problem 1: will fail because html is a str object;
# encode works on unicode objects so Python tries to decode it using
# 'ascii' and fails
# problem 2: even if it worked, the result will be ignored; it doesn't
# update html in situ, it returns a function result.
# problem 3: "ignore" with UTF-n: any valid unicode object
# should be encodable in UTF-n; error implies end of the world,
# don't try to ignore it. Don't just whack in "ignore" willy-nilly,
# put it in only with a comment explaining your very cogent reasons for doing so.
# "ignore" with most other encodings: error implies that you are mistaken
# in your choice of encoding -- same advice as for UTF-n :-)
# "ignore" with decode latin1 aka iso-8859-1: error implies end of the world.
# Irrespective of error or not, you are probably mistaken
# (needing e.g. cp1252 or even cp850 instead) ;-)

If you have a string line, you can use the .encode([encoding], [errors='strict']) method for strings to convert encoding types.
line = 'my big string'
line.encode('ascii', 'ignore')
For more information about handling ASCII and unicode in Python, this is a really useful site: https://docs.python.org/2/howto/unicode.html

I think the answer is there but only in bits and pieces, which makes it difficult to quickly fix the problem such as
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)
Let's take an example, Suppose I have file which has some data in the following form ( containing ascii and non-ascii chars )
1/10/17, 21:36 - Land : Welcome ï¿½ï¿½
and we want to ignore and preserve only ascii characters.
This code will do:
import unicodedata
fp = open(<FILENAME>)
for line in fp:
rline = line.strip()
rline = unicode(rline, "utf-8")
rline = unicodedata.normalize('NFKD', rline).encode('ascii','ignore')
if len(rline) != 0:
print rline
and type(rline) will give you
>type(rline)
<type 'str'>

unicodestring = '\xa0'
decoded_str = unicodestring.decode("windows-1252")
encoded_str = decoded_str.encode('ascii', 'ignore')
Works for me

You can use the following piece of code as an example to avoid Unicode to ASCII errors:
from anyascii import anyascii
content = "Base Rent for – CC# 2100 Acct# 8410: $41,667.00 – PO – Lines - for Feb to Dec to receive monthly"
content = anyascii(content)
print(content)

Looks like you are using python 2.x.
Python 2.x defaults to ascii and it doesn’t know about Unicode. Hence the exception.
Just paste the below line after shebang, it will work
# -*- coding: utf-8 -*-

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.