I'm trying to get bytes from a png file in python 3, and print a string showing the bytes from the png file. However, it gives me this output:
b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00(\x00\x00\x00(\x08\x02\x00\x00\x00\x03\x9c/:\x00\x00\x00\x01sRGB\x00\xae\xce\x1c\xe9\x00\x00\x00\x04gAMA\x00\x00\xb1\x8f\x0b\xfca\x05\x00\x00\x00\tpHYs\x00\x00\x0e\xc3\x00\x00\x0e\xc3\x01\xc7o\xa8d\x00\x00\x01XIDATXG\xe5\xcd\xb1m\x031\x14\x04\xd1\xebF\xad\xb8+\xd5\xe0\x8a\xe5`f\x19|,.\xa0\x0fL\xf4\xc0h\x08.\xafo\xf5>\xc8/a;\xc2/a;\xc2/a;\xc2/a;\xc2/a\x0b\xebC\x1c\r+la}\x88\xa3a\x85-\x88\xbf?\xff=p4\xac\xb0\x05q\xacl\x1c8\x1aV\xd8\x828V6\x0e\x1c\r+lA\x1c+\x1b\x07\x8e\x86\x15\xb6 \x8e\x95\x8d\x03G\xc3\n[\x10\xeb\xca\xbd\xfa\xc4\xd1\xb0\xc2\x16\xc4\xbar\xaf>q4\xac\xb0\x05\xb1\xae|\xde\xafz\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xf7\xea\x13G\xc3\n[\x10\xeb\xca\xbd\xfa\xc4\xd1\xb0\xc2\x16\xc4\xb1\xb2q\xe0hXa\x0b\xe2X\xd98p4\xac\xb0\x05q\xacl\x1c8\x1aV\xd8\x828V6\x0e\x1c\r+lA\x1c+\x1b\x07\x8e\x86\x15\xb6\xb0>\xc4\xd1\xb0\xc2\x16\xd6\x878\x1aV\xd8\x8e\xf0K\xd8\x8e\xf0K\xd8\x8e\xf0K\xd8\x8e\xf0K\xd8\x8e\xf0\xcb/s]\x7f\xf8o$|7\xc4\xdf\xeb\x00\x00\x00\x00IEND\xaeB`\x82'
instead of normal bytes (here are the bytes it should show): 89504E470D0A1A0A0000000D4948445200000028000000280802000000039C2F3A000000017352474200AECE1CE90000000467414D410000B18F0BFC6105000000097048597300000EC300000EC301C76FA86400000158494441545847E5CDB16D03311404D1EB46ADB82BD5E08AE56066197C2C2EA00F4CF4C068082EAF6FF53EC82F613BC22F613BC22F613BC22F613BC22F610BEB431C0D2B6C617D88A361852D88BF3FFF3D7034ACB00571AC6C1C381A56D8823856360E1C0D2B6C411C2B1B078E8615B6208E958D0347C30A5B10EBCABDFAC4D1B0C216C4BA72AF3E7134ACB005B1AE7CDEAF7AB8AD4F1C0D2B6C41AC2BE3BF75B8AD4F1C0D2B6C41AC2BE3BF75B8AD4F1C0D2B6C41AC2BE3BF75B8AD4F1C0D2B6C41AC2BE3BF75B8AD4F1C0D2B6C41AC2BE3BF75B8AD4F1C0D2B6C41AC2BE3BF75B8AD4F1C0D2B6C41AC2BE3BF75B8AD4F1C0D2B6C41AC2BF7EA1347C30A5B10EBCABDFAC4D1B0C216C4B1B271E06858610BE258D9387034ACB00571AC6C1C381A56D8823856360E1C0D2B6C411C2B1B078E8615B6B03EC4D1B0C216D687381A56D88EF04BD88EF04BD88EF04BD88EF04BD88EF0CB2F735D7FF86F247C37C4DFEB0000000049454E44AE426082
Here is the code that I wrote to do this:
fileread = input("Input File: ")
with open(fileread, 'rb') as readfile:
string = str(readfile.read())
readfile.close()
print("String: "+string)
newstr = str(bytes(string, 'utf-8').decode('utf-8'))
Can anyone help me?
You've got it right. It's just showing the ASCII representation of the data as that's usually the more useful form
>>> s = b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00(\x00\x00\x00(\x08\x02\x00\x00\x00\x03\x9c/:\x00\x00\x00\x01sRGB\x00\xae\xce\x1c\xe9\x00\x00\x00\x04gAMA\x00\x00\xb1\x8f\x0b\xfca\x05\x00\x00\x00\tpHYs\x00\x00\x0e\xc3\x00\x00\x0e\xc3\x01\xc7o\xa8d\x00\x00\x01XIDATXG\xe5\xcd\xb1m\x031\x14\x04\xd1\xebF\xad\xb8+\xd5\xe0\x8a\xe5`f\x19|,.\xa0\x0fL\xf4\xc0h\x08.\xafo\xf5>\xc8/a;\xc2/a;\xc2/a;\xc2/a;\xc2/a\x0b\xebC\x1c\r+la}\x88\xa3a\x85-\x88\xbf?\xff=p4\xac\xb0\x05q\xacl\x1c8\x1aV\xd8\x828V6\x0e\x1c\r+lA\x1c+\x1b\x07\x8e\x86\x15\xb6 \x8e\x95\x8d\x03G\xc3\n[\x10\xeb\xca\xbd\xfa\xc4\xd1\xb0\xc2\x16\xc4\xbar\xaf>q4\xac\xb0\x05\xb1\xae|\xde\xafz\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xf7\xea\x13G\xc3\n[\x10\xeb\xca\xbd\xfa\xc4\xd1\xb0\xc2\x16\xc4\xb1\xb2q\xe0hXa\x0b\xe2X\xd98p4\xac\xb0\x05q\xacl\x1c8\x1aV\xd8\x828V6\x0e\x1c\r+lA\x1c+\x1b\x07\x8e\x86\x15\xb6\xb0>\xc4\xd1\xb0\xc2\x16\xd6\x878\x1aV\xd8\x8e\xf0K\xd8\x8e\xf0K\xd8\x8e\xf0K\xd8\x8e\xf0K\xd8\x8e\xf0\xcb/s]\x7f\xf8o$|7\xc4\xdf\xeb\x00\x00\x00\x00IEND\xaeB`\x82'
>>> s[0]
137
>>> s[1]
80
>>> s[2]
78
>>> hex(s[0])
'0x89'
>>> hex(s[1])
'0x50'
>>> hex(s[2])
'0x4e'
>>>
I don't think you'd need the UTF-8 decode step as this is just binary data right?
If you actually want an ASCII representation of the data in hex form to match what you have in the question you could use
>>> ''.join('%02x' % c for c in s)
'89504e470d0a1a0a0000000d4948445200000028000000280802000000039c2f3a000000017352474200aece1ce90000000467414d410000b18f0bfc6105000000097048597300000ec300000ec301c76fa86400000158494441545847e5cdb16d03311404d1eb46adb82bd5e08ae56066197c2c2ea00f4cf4c068082eaf6ff53ec82f613bc22f613bc22f613bc22f613bc22f610beb431c0d2b6c617d88a361852d88bf3fff3d7034acb00571ac6c1c381a56d8823856360e1c0d2b6c411c2b1b078e8615b6208e958d0347c30a5b10ebcabdfac4d1b0c216c4ba72af3e7134acb005b1ae7cdeaf7ab8ad4f1c0d2b6c41ac2be3bf75b8ad4f1c0d2b6c41ac2be3bf75b8ad4f1c0d2b6c41ac2be3bf75b8ad4f1c0d2b6c41ac2be3bf75b8ad4f1c0d2b6c41ac2be3bf75b8ad4f1c0d2b6c41ac2be3bf75b8ad4f1c0d2b6c41ac2be3bf75b8ad4f1c0d2b6c41ac2bf7ea1347c30a5b10ebcabdfac4d1b0c216c4b1b271e06858610be258d9387034acb00571ac6c1c381a56d8823856360e1c0d2b6c411c2b1b078e8615b6b03ec4d1b0c216d687381a56d88ef04bd88ef04bd88ef04bd88ef04bd88ef0cb2f735d7ff86f247c37c4dfeb0000000049454e44ae426082'
You're getting the bytes fine; you just want to print them differently from the default Python method (which uses characters for printable ASCII codes so you can read them more easily). Just iterate over the bytes and format them however you like:
for byte in string:
print(("%02x" % byte).upper(), end="")
If the file isn't too large, you could also do it with one print() call by doing the formatting all at once and printing that:
print("".join(("%02x" % byte).upper() for byte in string))
This will build a string using approximately 6 times the amount of memory as your file before printing it. Use the first method if this could be a problem.
Actually, I just remembered... Python has a module for this!
from binascii import hexlify
print(hexlify(string).upper())
This will actually use even more memory, since it converts the letters in the hex string to uppercase after building it, but if you're OK with lowercase letters in your hex, this is probably the best solution.
BTW, it's advisable not to call what you read from your file string; it's binary data, not text.
Related
I'm working on a project in which I have to perform some byte operations using python and I'd like to understand some basic principals before I go on with it.
t1 = b"\xAC\x42\x4C\x45\x54\x43\x48\x49\x4E\x47\x4C\x45\x59"
t2 = "\xAC\x42\x4C\x45\x54\x43\x48\x49\x4E\x47\x4C\x45\x59"
print("Adding b character before: ",t1)
print("Using bytes(str): ",bytes(t2,"utf-8"))
print("Using str.encode: ",t2.encode())
In particular, I cannot understand why the console prints this when I run the code above:
C:\Users\Marco\PycharmProjects\codeTest\venv\Scripts\python.exe C:/Users/Marco/PycharmProjects/codeTest/msgPack/temp.py
Adding b character before: b'\xacBLETCHINGLEY'
Using bytes(str): b'\xc2\xacBLETCHINGLEY'
Using str.encode: b'\xc2\xacBLETCHINGLEY'
What I would like to understand is why, if I use bytes() or decode, I get an extra "\xc2" in front of the value. What does it mean? Is this supposed to appear? And if so, how can I get rid of it without using the first method?
Because bytes objects and str objects are two different things. The former represents a sequence of bytes, the latter represents a sequence of unicode code points. There's a huge difference between the byte 172 and the unicode code point 172.
In particular, the byte 172 doesn't encode anything in particular in unicode. On the other hand, unicode code point 172 refers to the following character:
>>> c = chr(172)
>>> print(c)
¬
And of course, they actual raw bytes this would correspond to depend on the encoding. Using utf-8 it is a two-byte encoding:
>>> c.encode()
b'\xc2\xac'
In the latin-1 encoding, it is a 1 byte:
>>> c.encode('latin')
b'\xac'
If you want raw bytes, the most precise/easy way then is to use a bytes-literal.
In a string literal, \xhh (h being a hex digit) selects the corresponding unicode character U+0000 to U+00FF, with U+00AC being the ¬ "not sign". When encoding to utf-8, all code points above 0x7F take two or more bytes. \xc2\xac is the utf-8 encoding of U+00AC.
>>> "\u00AC" == "\xAC"
True
>>> "\u00AC" == "¬"
True
>>> "\xAC" == "¬"
True
>>> "\u00AC".encode('utf-8')
b'\xc2\xac'
>>> "¬".encode("utf-8")
b'\xc2\xac'
I should not expect any error here. I just want to take the string literaly and translate it into its bytes. I don't want to encode or decode anything.
I am taking here a stupid example:
>>> astring
u'\xb0'
Stupid enough to give me headache...
>>> bytes(astring)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb0' in position...
One horrible trick is to do this:
>>> bytes(repr(astring)[2:-1])
'\xb0'
One other bad solution is:
>>> bytes(astring.encode("utf-8"))
'\xc2\xb0'
It is a bad solution because my string is not composed of two chars. This is wrong.
Another awful solution would be:
>>> bytes(''.join(map(bytes, [chr(ord(c)) for c in astring])))
'\xb0'
I am using Python 2.7
Background
I would like to compare two columns on a database where the encoding is unknown and sometime conflicting. I don't care about wrong chars on my dump. I just want to get it to have a look at it.
If your Unicode strings are guaranteed to only contain codepoints < 256 then you can convert them to bytes using the Latin1 encoding. Here's some Python 2 code that performs this conversion on all codepoints in range(256).
r = range(256)
s = u''.join([unichr(i) for i in r])
print repr(s)
b = s.encode('latin1')
print repr(b)
a = [ord(c) for c in b]
print a == r
output
u'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
True
FWIW, here's the equivalent Python 3 code.
r = range(256)
s = u''.join([chr(i) for i in r])
print(repr(s))
b = s.encode('latin1')
print(repr(b))
print(list(b) == list(r))
output
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0¡¢£¤¥¦§¨©ª«¬\xad®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'
b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
True
Note that the Python 3 Unicode repr output is a little more human-friendly.
You cannot just 'take the string literally' because the actual, internal, bytes representation of your string is not fixed and is an implementation detail of the your python interpreter that your should not rely on (see PEP3993, on the same system different string can use different internal encoding).
That also means that to get a byte representation of you string, you really need to encode it, and thus specify the encoding.
By the way, astring.encode("utf-8") is not wrong (and already returns a bytes, you don't need the extra bytes(...) in your code), as in utf-8 a single character can be represented as several bytes.
You should be able to just add b before the quotes of the string.
>>> astring = b'\xb0'
>>> astring
b'\xb0'
>>> bytes(astring)
b'\xb0'
>>>
Putting b before the string makes it a bytes object. No more UnicodeEncodeError.
I want to encode string to bytes.
To convert to byes, I used byte.fromhex()
>>> byte.fromhex('7403073845')
b't\x03\x078E'
But it displayed some characters.
How can it be displayed as hex like following?
b't\x03\x078E' => '\x74\x03\x07\x38\x45'
I want to encode string to bytes.
bytes.fromhex() already transforms your hex string into bytes. Don't confuse an object and its text representation -- REPL uses sys.displayhook that uses repr() to display bytes in ascii printable range as the corresponding characters but it doesn't affect the value in any way:
>>> b't' == b'\x74'
True
Print bytes to hex
To convert bytes back into a hex string, you could use bytes.hex method since Python 3.5:
>>> b't\x03\x078E'.hex()
'7403073845'
On older Python version you could use binascii.hexlify():
>>> import binascii
>>> binascii.hexlify(b't\x03\x078E').decode('ascii')
'7403073845'
How can it be displayed as hex like following? b't\x03\x078E' => '\x74\x03\x07\x38\x45'
>>> print(''.join(['\\x%02x' % b for b in b't\x03\x078E']))
\x74\x03\x07\x38\x45
The Python repr can't be changed. If you want to do something like this, you'd need to do it yourself; bytes objects are trying to minimize spew, not format output for you.
If you want to print it like that, you can do:
from itertools import repeat
hexstring = '7403073845'
# Makes the individual \x## strings using iter reuse trick to pair up
# hex characters, and prefixing with \x as it goes
escapecodes = map(''.join, zip(repeat(r'\x'), *[iter(hexstring)]*2))
# Print them all with quotes around them (or omit the quotes, your choice)
print("'", *escapecodes, "'", sep='')
Output is exactly as you requested:
'\x74\x03\x07\x38\x45'
I have a string in python 3 that has several unicode representations in it, for example:
t = 'R\\u00f3is\\u00edn'
and I want to convert t so that it has the proper representation when I print it, ie:
>>> print(t)
Róisín
However I just get the original string back. I've tried re.sub and some others, but I can't seem to find a way that will change these characters without having to iterate over each one.
What would be the easiest way to do so?
You want to use the built-in codec unicode_escape.
If t is already a bytes (an 8-bit string), it's as simple as this:
>>> print(t.decode('unicode_escape'))
Róisín
If t has already been decoded to Unicode, you can to encode it back to a bytes and then decode it this way. If you're sure that all of your Unicode characters have been escaped, it actually doesn't matter what codec you use to do the encode. Otherwise, you could try to get your original byte string back, but it's simpler, and probably safer, to just force any non-encoded characters to get encoded, and then they'll get decoded along with the already-encoded ones:
>>> print(t.encode('unicode_escape').decode('unicode_escape')
Róisín
In case you want to know how to do this kind of thing with regular expressions in the future, note that sub lets you pass a function instead of a pattern for the repl. And you can convert any hex string into an integer by calling int(hexstring, 16), and any integer into the corresponding Unicode character with chr (note that this is the one bit that's different in Python 2—you need unichr instead). So:
>>> re.sub(r'(\\u[0-9A-Fa-f]+)', lambda matchobj: chr(int(matchobj.group(0)[2:], 16)), t)
Róisín
Or, making it a bit more clear:
>>> def unescapematch(matchobj):
... escapesequence = matchobj.group(0)
... digits = escapesequence[2:]
... ordinal = int(digits, 16)
... char = chr(ordinal)
... return char
>>> re.sub(r'(\\u[0-9A-Fa-f]+)', unescapematch, t)
Róisín
The unicode_escape codec actually handles \U, \x, \X, octal (\066), and special-character (\n) sequences as well as just \u, and it implements the proper rules for reading only the appropriate max number of digits (4 for \u, 8 for \U, etc., so r'\\u22222' decodes to '∢2' rather than '𢈢'), and probably more things I haven't thought of. But this should give you the idea.
First of all, it is rather confused what you what to convert to.
Just imagine that you may want to convert to 'o' and 'i'. In this case you can just make a map:
mp = {u'\u00f3':'o', u'\u00ed':'i'}
Than you may apply the replacement like:
t = u'R\u00f3is\u00edn'
for i in range(len(t)):
if t[i] in mp: t[i]=mp[t[i]]
print t
I apologize for posting as a second answer, I don't have the reputation to comment on abarnert's solution.
After using his function to process approximately 50K android strings I noticed that there is yet another small improvement possible for certain use-cases.
I changed the + to {1,4} to deal with the case where valid hex characters follow a 4-digit escape.
I also changed int(escapesequence) to read int(digits)
>>> def unescapematch(matchobj):
... escapesequence = matchobj.group(0)
... digits = escapesequence[2:]
... ordinal = int(digits, 16)
... char = unichr(ordinal)
... return char
>>> print re.sub(r'(\\u[0-9A-Fa-f]{1,4})', unescapematch, "Wi\u2011Fi")
Wi‑Fi
>>> print re.sub(r'(\\u[0-9A-Fa-f]+)', unescapematch, "Wi\u2011Fi")
Traceback (most recent call last):
File "<pyshell#102>", line 1, in <module>
print re.sub(r'(\\u[0-9A-Fa-f]+)', unescapematch, "Wi\u2011Fi")
File "C:\Python27\lib\re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "<pyshell#99>", line 5, in unescapematch
char = unichr(ordinal)
ValueError: unichr() arg not in range(0x10000) (narrow Python build)
for analysis I'd have to unescape URL-encoded binary strings (non-printable characters most likely). The strings sadly come in the extended URL-encoding form, e.g. "%u616f". I want to store them in a file that then contains the raw binary values, eg. 0x61 0x6f here.
How do I get this into binary data in python? (urllib.unquote only handles the "%HH"-form)
The strings sadly come in the extended URL-encoding form, e.g. "%u616f"
Incidentally that's not anything to do with URL-encoding. It's an arbitrary made-up format produced by the JavaScript escape() function and pretty much nothing else. If you can, the best thing to do would be to change the JavaScript to use the encodeURIComponent function instead. This will give you a proper, standard URL-encoded UTF-8 string.
e.g. "%u616f". I want to store them in a file that then contains the raw binary values, eg. 0x61 0x6f here.
Are you sure 0x61 0x6f (the letters "ao") is the byte stream you want to store? That would imply UTF-16BE encoding; are you treating all your strings that way?
Normally you'd want to turn the input into Unicode then write it out using an appropriate encoding, such as UTF-8 or UTF-16LE. Here's a quick way of doing it, relying on the hack of making Python read '%u1234' as the string-escaped format u'\u1234':
>>> ex= 'hello %e9 %u616f'
>>> ex.replace('%u', r'\u').replace('%', r'\x').decode('unicode-escape')
u'hello \xe9 \u616f'
>>> print _
hello é 慯
>>> _.encode('utf-8')
'hello \xc2\xa0 \xe6\x85\xaf'
I guess you will have to write the decoder function by yourself. Here is an implementation to get you started:
def decode(file):
while True:
c = file.read(1)
if c == "":
# End of file
break
if c != "%":
# Not an escape sequence
yield c
continue
c = file.read(1)
if c != "u":
# One hex-byte
yield chr(int(c + file.read(1), 16))
continue
# Two hex-bytes
yield chr(int(file.read(2), 16))
yield chr(int(file.read(2), 16))
Usage:
input = open("/path/to/input-file", "r")
output = open("/path/to/output-file", "wb")
output.writelines(decode(input))
output.close()
input.close()
Here is a regex-based approach:
# the replace function concatenates the two matches after
# converting them from hex to ascii
repfunc = lambda m: chr(int(m.group(1), 16))+chr(int(m.group(2), 16))
# the last parameter is the text you want to convert
result = re.sub('%u(..)(..)', repfunc, '%u616f')
print result
gives
ao