Python Unicode hex string decoding - python

I have the following string: u'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9' encoded in windows-1255 and I want to decode it into Unicode code points (u'\u05d4\u05d7\u05dc\u05e7 \u05d4\u05e9\u05dc\u05d9\u05e9\u05d9').
>>> u'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'.decode('windows-1255')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\cp1255.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
However, if I try to decode the string: '\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9' I don't get the exception:
>>> '\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'.decode('windows-1255')
u'\u05d4\u05d7\u05dc\u05e7 \u05d4\u05e9\u05dc\u05d9\u05e9\u05d9'
How do I decode the Unicode hex string (the one that gets the exception) or convert it to a regular string that can be decoded?
Thanks for the help.

I have the following string: u'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9' encoded in windows-1255
That is self-contradictory. The u indicates it is a Unicode string. But if you say it is encoded in whatever, it must be a byte string (because a Unicode string can only be encoded into a byte string).
And indeed - your given entities - \xe4\xe7 etc. - represent a byte each, and only through the given encoding, windows-1255 they are given their respective meaning.
In other words, if you have a u'\xe4', you can be sure it is the same as u'\u00e4' and NOT u'\u05d4' as it would be the case otherwise.
If, by any chance, you got your erroneous Unicode string from a source which is unaware of this problem, you can derive from it the byte string you really need: with the help of a "1:1 coding", which is latin1.
So
correct_str = u_str.encode("latin1")
# now every byte of the correct_str corresponds to the respective code point in the 0x80..0xFF range
correct_u_str = correct_str.decode("windows-1255")

That's because \xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9 is a byte array, not a Unicode string: The bytes represent valid windows-1255 characters rather than valid Unicode code points.
Therefore, when prepending it with a u, the Python interpreter can not decode the string, or even print it:
>>> print u'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
So, in order to convert your byte array to UTF-8, you will have to decode it as windows-1255 and then encode it to utf-8:
>>> '\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'.decode('windows-1255')
.encode('utf8')
'\xd7\x94\xd7\x97\xd7\x9c\xd7\xa7 \xd7\x94\xd7\xa9\xd7\x9c\xd7\x99\xd7\xa9\xd7\x99'
Which gives the original Hebrew text:
>>> print '\xd7\x94\xd7\x97\xd7\x9c\xd7\xa7 \xd7\x94\xd7\xa9\xd7\x9c\xd7\x99\xd7\xa9\xd7\x99'
החלק השלישי

Try this
>> u'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'.encode('latin-1').decode('windows-1255')
u'\u05d4\u05d7\u05dc\u05e7 \u05d4\u05e9\u05dc\u05d9\u05e9\u05d9'

Decode like this,
>>> b'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'.decode('windows-1255')
u'\u05d4\u05d7\u05dc\u05e7 \u05d4\u05e9\u05dc\u05d9\u05e9\u05d9'

Related

Not able to convert HEX to ASCII in python 3.6.3

I tried below methods, but no luck.
method 1:
var contains the hex value
bytes.fromhex(var).decode('ascii')
Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xdb in position 0: ordinal not in range(128)
method 2:
codecs.decode(var,"hex")
This is returning me in bytes, not in ASCII.
Can someone help on this conversion?
Did you try:
chr(var)
This should give you the character for an ASCII code.
As it follows from the question I assume using Python 3x.
The reason of your error is the fact, that you try to decode with ASCII the byte '0xdb' which has the value above 127.
You just can't do that - there is no such a byte value in ASCII encoding.
Your options are:
1. Ignore decode errors:
>>> u = 'DB91132598CC' # unicode
>>> b = codecs.decode(u,"hex") # bytes
>>> b
b'\xdb\x91\x13%\x98\xcc'
>>> result = b.decode("ascii", errors="ignore") # unicode
>>> result
'\x13%'
2. Use different encoding:
>>> result = b.decode("cp1252") # for example
>>> result
'Û‘\x13%˜Ì'
If you want only ASCII chars in the result use option #1.
have you tried
codecs.decode(codecs.decode(var,'hex'),'ascii')
>>> var = int('7A', 16) #var is an integer now
>>> chr(var) #int value to char
'z'
This solution is only for one character. You have to split the string and convert all hex values seperated. Look here how to split it.
The following code:
codecs.decode(var,"hex")
in Python 2.7 results with str:
'\xdb\x91\x13%\x98\xcc\xbfv\xaef\x8bK\x08Qv\xbb\x19\'u"\x1f\xdb\xb5\x0f\xce\x1c9\'\xc0w\xea\xf1\xe3\xda\xc4\xc8\xa8\xe8\x02\x8c?r\x95\xef\x81W\xce\xd5\x97\xa3n\xf1\xc3\xbf\xa4QG{\xff2\xee\xb1\x80l,\xc0D%\x85\x19z+\xcd,C\x92\x14z\xad\xb90f\xd0\xbaZ\xb6\xdb\xfd?o\xce\xb7\x07:\xe6\x1a]J\xa8\xab\xcb\xcf\xf4\xee\xbd\x1a\x16Uh\x9b\xfd~\xab\x82\xd7{\xf7"Ou\xfb\xcd2<\x9b\x9f\xa9\xc0\xb7\xd7\x99\x18\x08x\xa8\x1d]\x07\xcf\x05\xbe9\xee\xf9\x89\xb2\xfc0w\x99},/\x11b\xe5\xb4}\x99\xe4\xb4\x15\xbc\x8c\xe5\xc7UGi1\xbd\x8e\xd1K_\xce\xc1\xc8\xc6TQYF\xabx`\xbb\xbe\xe7\xdc\xcf\xda\xa7\xaaA\x0f\xf6SR\xb1S\xb5\x87(\xd5x\x14\xc6\x10\xf8%(m\x83\x0c0\x84)\xbd\xcf\x11g\x88{\x12^\xfb/\xa3K=\xea\xcd2\x9fWgL\x07\x1b\xefl\x9c\xea\xc0\xc7\xfa\xbbXz\x1do\x8bM\x0bS'
and in Python 3.6 bytes:
b'\xdb\x91\x13%\x98\xcc\xbfv\xaef\x8bK\x08Qv\xbb\x19\'u"\x1f\xdb\xb5\x0f\xce\x1c9\'\xc0w\xea\xf1\xe3\xda\xc4\xc8\xa8\xe8\x02\x8c?r\x95\xef\x81W\xce\xd5\x97\xa3n\xf1\xc3\xbf\xa4QG{\xff2\xee\xb1\x80l,\xc0D%\x85\x19z+\xcd,C\x92\x14z\xad\xb90f\xd0\xbaZ\xb6\xdb\xfd?o\xce\xb7\x07:\xe6\x1a]J\xa8\xab\xcb\xcf\xf4\xee\xbd\x1a\x16Uh\x9b\xfd~\xab\x82\xd7{\xf7"Ou\xfb\xcd2<\x9b\x9f\xa9\xc0\xb7\xd7\x99\x18\x08x\xa8\x1d]\x07\xcf\x05\xbe9\xee\xf9\x89\xb2\xfc0w\x99},/\x11b\xe5\xb4}\x99\xe4\xb4\x15\xbc\x8c\xe5\xc7UGi1\xbd\x8e\xd1K_\xce\xc1\xc8\xc6TQYF\xabx`\xbb\xbe\xe7\xdc\xcf\xda\xa7\xaaA\x0f\xf6SR\xb1S\xb5\x87(\xd5x\x14\xc6\x10\xf8%(m\x83\x0c0\x84)\xbd\xcf\x11g\x88{\x12^\xfb/\xa3K=\xea\xcd2\x9fWgL\x07\x1b\xefl\x9c\xea\xc0\xc7\xfa\xbbXz\x1do\x8bM\x0bS'
The reason is that in Python 2.7
str == bytes #True
while in Python 3.6
str == bytes #False
Python 2 strings are byte strings, while Python 3 strings are unicode strings. Both results are actually the same, but byte string in Python 3 is of type bytes not str and literal representation is prefixed with b.
This has nothing to do with ASCII encoding as none of the output variables (regardless of Python version) is ASCII encoded.
Moreover on Python 2.7 you will also get this error:
codecs.decode(var, 'hex').decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdb in position 0: ordinal not in range(128)
I actually pasted it from Python 2.7 interpreter but you can check yourself.
Your output string can't be decoded with ascii codec in any version of Python because it's simply not ascii encoded string in any case.

Convert utf-8 string to cp950 encoding in python

I'm handling an encoding problem.
My input is a unicode string, such as:
>>> s
u'\xa6\xe8\xac\xc9'
Actually it is encoded in cp950. I want to decode it: (notice there's no "u")
>>> print unicode('\xa6\xe8\xac\xc9', 'cp950')
西界
However, I don't know how to get rid of that "u".
Direct conversion is not working:
>>> str(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
The result of using encode() is not what I wanted:
>>> s.encode('utf8')
'\xc2\xa6\xc3\xa8\xc2\xac\xc3\x89'
what I want is '\xa6\xe8\xac\xc9'
This is a bit of an abuse of the unicode type. Characters in a unicode string are expected to be Unicode codepoints (e.g. u'\u897f\u754c'), and thus are encoding-agnostic. They are not supposed to be bytes from a specific encoding (Python 3 makes this distinction very clear by separating Unicode strings str, from byte strings bytes).
Since you want to just interpret each codepoint as bytes, you can do
u'\xa6\xe8\xac\xc9'.encode('iso-8859-1')
since the first 256 codepoints of Unicode are defined to be equal to the codepoints of ISO-8859-1. However, please try to fix the issue that gave you this incorrect Unicode string in the first place.
So let's get this straight: you have a sequence of bytes that were read in as Unicode codepoints, and you need them to be interpreted as cp950 instead?
>>> ''.join(chr(ord(c)) for c in s)
'\xa6\xe8\xac\xc9'
>>> print ''.join(chr(ord(c)) for c in s).decode('cp950')
西界

What's the difference between these method to deal with Unicode strings in Python?

I tried print a_str.decode("utf-8"), print uni_str, print uni_str.decode("utf-8"),print uni_str.encode("utf-8")..
But only the first one works.
>>> print '\xe8\xb7\xb3'.decode("utf-8")
跳
>>> print u'\xe8\xb7\xb3\xe8'
è·³è
>>> print u'\xe8\xb7\xb3\xe8'.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
>>> print u'\xe8\xb7\xb3\xe8'.encode("utf-8")
è·³è
I'm really confused with how to display a Unicode string normally. If I have a string like this:
a=u'\xe8\xb7\xb3\xe8', how can I print a?
'\xe8\xb7\xb3' is a Chinese character encoded with utf8, so '\xe8\xb7\xb3'.decode('utf-8') works fine, which returns the unicode value of 跳, u'\u8df3'. But u'\xe8\xb7\xb3' is a literal unicode String, which is not same with the unicode of 跳. And a unicode string cannot be decoded, it's unicode.
At last,a=u'\xe8\xb7\xb3\xe8' is really not a valid unicode string[1].
Where the u'\xe8\xb7\xb3' comes from? Another function?
[1]Check out the first comment.
If you have a string like that then it's broken. You'll need to encode it as Latin-1 to get it to a bytestring with the same byte values, and then decode as UTF-8.
The unicode string u'\xe8\xb7\xb3\xe8' is equivalent to u'\u00e8\u00b7\u00b3\u00e8'. What you want is u'\u8df3' which can be encoded in utf8 as '\xe8\xb7\xb3'.
In Python, unicode is a UCS-2 string (build option). So, u'\xe8\xb7\xb3\xe8' is a string of 4 16bit Unicode characters.
If you got a utf-8 string (8bit string) incorrectly presented as Unicode (16bit string), you have to convert it to 8bit string first:
>>> ''.join([chr(ord(a)) for a in u'\xe8\xb7\xb3']).decode('utf8')
u'\u8df3'
Note that '\xe8\xb7\xb3\xe8' is not valid utf8 string as the last byte '\xe8' is a first character of a two byte sequence and cannot terminate a utf8 string.

Convert hash.digest() to unicode

import hashlib
string1 = u'test'
hashstring = hashlib.md5()
hashstring.update(string1)
string2 = hashstring.digest()
unicode(string2)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8f in position 1: ordinal
not in range(128)
The string HAS to be unicode for it to be any use to me, can this be done?
Using python 2.7 if that helps...
Ignacio just gave the perfect answer. Just a complement: when you convert some string from an encoding which has chars not found in ASCII to unicode, you have to pass the encoding as a parameter:
>>> unicode("órgão")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>> unicode("órgão", "UTF-8")
u'\xf3rg\xe3o'
If you cannot say what is the original encoding (UTF-8 in my example) you really cannot convert to Unicode. It is a signal that something is not pretty correct in your intentions.
Last but not least, encodings are pretty confusing stuff. This comprehensive text about them can make them clear.
The result of .digest() is a bytestring¹, so converting it to Unicode is pointless. Use .hexdigest() if you want a readable representation.
¹ Some bytestrings can be converted to Unicode, but the bytestrings returned by .digest() do not contain textual data. They can contain any byte including the null byte: they're usually not printable without using escape sequences.

Convert or strip out "illegal" Unicode characters

I've got a database in MSSQL that I'm porting to SQLite/Django. I'm using pymssql to connect to the database and save a text field to the local SQLite database.
However for some characters, it explodes. I get complaints like this:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 1916: ordinal not in range(128)
Is there some way I can convert the chars to proper unicode versions? Or strip them out?
Once you have the string of bytes s, instead of using it as a unicode obj directly, convert it explicitly with the right codec, e.g.:
u = s.decode('latin-1')
and use u instead of s in the code that follows this point (presumably the part that writes to sqlite). That's assuming latin-1 is the encoding that was used to make the byte string originally -- it's impossible for us to guess, so try to find out;-).
As a general rule, I suggest: don't process in your applications any text as encoded byte strings -- decode them to unicode objects right after input, and, if necessary, encode them back to byte strings right before output.
When you decode, just pass 'ignore' to strip those characters
there is some more way of stripping / converting those are
'replace': replace malformed data with a suitable replacement marker, such as '?' or '\ufffd'
'ignore': ignore malformed data and continue without further notice
'backslashreplace': replace with backslashed escape sequences (for encoding only)
Test
>>> "abcd\x97".decode("ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 4: ordinal not in range(128)
>>>
>>> "abcd\x97".decode("ascii","ignore")
u'abcd'

Categories

Resources