for analysis I'd have to unescape URL-encoded binary strings (non-printable characters most likely). The strings sadly come in the extended URL-encoding form, e.g. "%u616f". I want to store them in a file that then contains the raw binary values, eg. 0x61 0x6f here.
How do I get this into binary data in python? (urllib.unquote only handles the "%HH"-form)
The strings sadly come in the extended URL-encoding form, e.g. "%u616f"
Incidentally that's not anything to do with URL-encoding. It's an arbitrary made-up format produced by the JavaScript escape() function and pretty much nothing else. If you can, the best thing to do would be to change the JavaScript to use the encodeURIComponent function instead. This will give you a proper, standard URL-encoded UTF-8 string.
e.g. "%u616f". I want to store them in a file that then contains the raw binary values, eg. 0x61 0x6f here.
Are you sure 0x61 0x6f (the letters "ao") is the byte stream you want to store? That would imply UTF-16BE encoding; are you treating all your strings that way?
Normally you'd want to turn the input into Unicode then write it out using an appropriate encoding, such as UTF-8 or UTF-16LE. Here's a quick way of doing it, relying on the hack of making Python read '%u1234' as the string-escaped format u'\u1234':
>>> ex= 'hello %e9 %u616f'
>>> ex.replace('%u', r'\u').replace('%', r'\x').decode('unicode-escape')
u'hello \xe9 \u616f'
>>> print _
hello é 慯
>>> _.encode('utf-8')
'hello \xc2\xa0 \xe6\x85\xaf'
I guess you will have to write the decoder function by yourself. Here is an implementation to get you started:
def decode(file):
while True:
c = file.read(1)
if c == "":
# End of file
break
if c != "%":
# Not an escape sequence
yield c
continue
c = file.read(1)
if c != "u":
# One hex-byte
yield chr(int(c + file.read(1), 16))
continue
# Two hex-bytes
yield chr(int(file.read(2), 16))
yield chr(int(file.read(2), 16))
Usage:
input = open("/path/to/input-file", "r")
output = open("/path/to/output-file", "wb")
output.writelines(decode(input))
output.close()
input.close()
Here is a regex-based approach:
# the replace function concatenates the two matches after
# converting them from hex to ascii
repfunc = lambda m: chr(int(m.group(1), 16))+chr(int(m.group(2), 16))
# the last parameter is the text you want to convert
result = re.sub('%u(..)(..)', repfunc, '%u616f')
print result
gives
ao
Related
I have to deal with rtf files where cyrillic characters are converted to escaped sequences:
{\rtf1\fbidis\ansicpg1251{\info{\title \'D1\'EF\'E8\'F1\'EE\'EA\'20\'E4\'EE\'EA\'F3\'EC\'E5\'ED\'F2\'EE\'E2}
I want to convert cyrillic symbols, but left rtf tags unchaged. Is there pythonic way to do it without third party apps (like OpenOffice)?
We can first make a list of the hex codes using a regex, then create a bytes object with these values, which we can decode. It appears that your data was encoded using "cp1251".
data = r"pg1251{\info{\title \'D1\'EF\'E8\'F1\'EE\'EA\'20\'E4\'EE\'EA\'F3\'EC\'E5\'ED\'F2\'EE\'E2}"
hex_codes = re.findall(r"(?<=')[0-9A-F]{2}", data)
encoded = bytes(int(hcode, 16) for hcode in hex_codes)
# or, as rightly suggested by #Henry Tjhia:
# encoded = bytes.fromhex(''.join(hex_codes))
text = encoded.decode('cp1251')
print(text)
# Список документов
Despite #Thierry Lathuille's answer didn't solve the initial problem (I need rtf tags unchanged) it solved the most difficult part. So the solution to the initial problem:
string = "{\rtf1\fbidis\ansicpg1251{\info{\title \'D1\'EF\'E8\'F1\'EE\'EA\'20\'E4\'EE\'EA\'F3\'EC\'E5\'ED\'F2\'EE\'E2}"
hex_codes = re.findall("(?<=')[0-9A-F]{2}", string)
d = {"\\\'" + code: bytes.fromhex(code).decode("cp1251") for code in hex_codes}
for byte, char in d.items():
string = string.replace(byte, char)
print(string)
# {\rtf1\fbidis\ansicpg1251{\info{\title Список документов}
I am attempting to extract numerical values from a byte string transmitted from an RS-232 port. Here is an example:
b'S\xa0S\xa0\xa0\xa0\xa0\xa0\xa0\xb23.6\xb7\xa0\xe7\x8d\n'
If I attempt to decode the byte string as 'utf-8', I receive the following output:
x = b'S\xa0S\xa0\xa0\xa0\xa0\xa0\xa0\xb23.6\xb7\xa0\xe7\x8d\n'
x.decode('utf-8', errors='ignore')
>>> 'SS3.6\n'
What I ideally want is 23.67, which is observed after every \xb pattern. How could I extract 23.67 from this byte string?
As mentioned in https://stackoverflow.com/a/59416410/3319460, your input actually doesn't really represent the output you seek. But just to fulfil your requirements, of course, we might set semantics onto the input such that
numbers or '.' sign is allowed, others are skipped
if the byte is non-ASCII character such whether the first four bytes are 0xB. If it is the case then we will simply take the ASCII part of the byte (b & 0b01111111)
That is quite easily done in Python.
def _filter(char):
return char & 0xF0 == 0xB0 or chr(char) == "." or 48 <= char <= 58
def filter_xbchars(value: bytes) -> str:
return "".join(chr(ch & 0b01111111) for ch in value if _filter(ch))
import pytest
#pytest.mark.parametrize(
"value, expected",
[(b"S\xa0S\xa0\xa0\xa0\xa0\xa0\xa0\xb23.6\xb7\xa0\xe7\x8d\n", "23.67")],
)
def test_simple(value, expected):
assert filter_xbchars(value) == expected
Please be aware that even though code above satisfies the requirements it is an example of a poorly described task and as a result quite nonsensical solution. The code solves the task as you asked for it but we should firstly reconsider whether it even makes sense. I advise you to check the data you will test against and the meaning of the data (protocol).
Good luck :)
If you just want to get 23.67 from that byte string try this:
a = b'S\xa0S\xa0\xa0\xa0\xa0\xa0\xa0\xb23.6\xb7\xa0\xe7\x8d\n'
b = repr(a)[2:-1]
c = b.split("\\")
d = ''
e = []
for i in c:
if "xb" in i:
e.append(i[2:])
d = "".join(e)
print(d)
Please notice that \xHH is an escape code representing hexadecimal value HH and as such your string '\xb23.6\xb7' does not contain "23.67" but rater "(0xB2)3.6(0xB7)", those value cannot be extracted using a regular expression because it's not present in the string in the first place.
'\xb23.6\xb7' is not a valid UTF-8 sequence, and in Latin-1 extended ASCII it would represent "²3.6·"; the presence of many 0xA0 values would suggest a Latin-1 encoding as it represent a non-breaking space in that encoding (a fairly common character) while in UTF-8 it does not encode a meaningful sequence.
I'm trying to get bytes from a png file in python 3, and print a string showing the bytes from the png file. However, it gives me this output:
b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00(\x00\x00\x00(\x08\x02\x00\x00\x00\x03\x9c/:\x00\x00\x00\x01sRGB\x00\xae\xce\x1c\xe9\x00\x00\x00\x04gAMA\x00\x00\xb1\x8f\x0b\xfca\x05\x00\x00\x00\tpHYs\x00\x00\x0e\xc3\x00\x00\x0e\xc3\x01\xc7o\xa8d\x00\x00\x01XIDATXG\xe5\xcd\xb1m\x031\x14\x04\xd1\xebF\xad\xb8+\xd5\xe0\x8a\xe5`f\x19|,.\xa0\x0fL\xf4\xc0h\x08.\xafo\xf5>\xc8/a;\xc2/a;\xc2/a;\xc2/a;\xc2/a\x0b\xebC\x1c\r+la}\x88\xa3a\x85-\x88\xbf?\xff=p4\xac\xb0\x05q\xacl\x1c8\x1aV\xd8\x828V6\x0e\x1c\r+lA\x1c+\x1b\x07\x8e\x86\x15\xb6 \x8e\x95\x8d\x03G\xc3\n[\x10\xeb\xca\xbd\xfa\xc4\xd1\xb0\xc2\x16\xc4\xbar\xaf>q4\xac\xb0\x05\xb1\xae|\xde\xafz\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xf7\xea\x13G\xc3\n[\x10\xeb\xca\xbd\xfa\xc4\xd1\xb0\xc2\x16\xc4\xb1\xb2q\xe0hXa\x0b\xe2X\xd98p4\xac\xb0\x05q\xacl\x1c8\x1aV\xd8\x828V6\x0e\x1c\r+lA\x1c+\x1b\x07\x8e\x86\x15\xb6\xb0>\xc4\xd1\xb0\xc2\x16\xd6\x878\x1aV\xd8\x8e\xf0K\xd8\x8e\xf0K\xd8\x8e\xf0K\xd8\x8e\xf0K\xd8\x8e\xf0\xcb/s]\x7f\xf8o$|7\xc4\xdf\xeb\x00\x00\x00\x00IEND\xaeB`\x82'
instead of normal bytes (here are the bytes it should show): 89504E470D0A1A0A0000000D4948445200000028000000280802000000039C2F3A000000017352474200AECE1CE90000000467414D410000B18F0BFC6105000000097048597300000EC300000EC301C76FA86400000158494441545847E5CDB16D03311404D1EB46ADB82BD5E08AE56066197C2C2EA00F4CF4C068082EAF6FF53EC82F613BC22F613BC22F613BC22F613BC22F610BEB431C0D2B6C617D88A361852D88BF3FFF3D7034ACB00571AC6C1C381A56D8823856360E1C0D2B6C411C2B1B078E8615B6208E958D0347C30A5B10EBCABDFAC4D1B0C216C4BA72AF3E7134ACB005B1AE7CDEAF7AB8AD4F1C0D2B6C41AC2BE3BF75B8AD4F1C0D2B6C41AC2BE3BF75B8AD4F1C0D2B6C41AC2BE3BF75B8AD4F1C0D2B6C41AC2BE3BF75B8AD4F1C0D2B6C41AC2BE3BF75B8AD4F1C0D2B6C41AC2BE3BF75B8AD4F1C0D2B6C41AC2BE3BF75B8AD4F1C0D2B6C41AC2BF7EA1347C30A5B10EBCABDFAC4D1B0C216C4B1B271E06858610BE258D9387034ACB00571AC6C1C381A56D8823856360E1C0D2B6C411C2B1B078E8615B6B03EC4D1B0C216D687381A56D88EF04BD88EF04BD88EF04BD88EF04BD88EF0CB2F735D7FF86F247C37C4DFEB0000000049454E44AE426082
Here is the code that I wrote to do this:
fileread = input("Input File: ")
with open(fileread, 'rb') as readfile:
string = str(readfile.read())
readfile.close()
print("String: "+string)
newstr = str(bytes(string, 'utf-8').decode('utf-8'))
Can anyone help me?
You've got it right. It's just showing the ASCII representation of the data as that's usually the more useful form
>>> s = b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00(\x00\x00\x00(\x08\x02\x00\x00\x00\x03\x9c/:\x00\x00\x00\x01sRGB\x00\xae\xce\x1c\xe9\x00\x00\x00\x04gAMA\x00\x00\xb1\x8f\x0b\xfca\x05\x00\x00\x00\tpHYs\x00\x00\x0e\xc3\x00\x00\x0e\xc3\x01\xc7o\xa8d\x00\x00\x01XIDATXG\xe5\xcd\xb1m\x031\x14\x04\xd1\xebF\xad\xb8+\xd5\xe0\x8a\xe5`f\x19|,.\xa0\x0fL\xf4\xc0h\x08.\xafo\xf5>\xc8/a;\xc2/a;\xc2/a;\xc2/a;\xc2/a\x0b\xebC\x1c\r+la}\x88\xa3a\x85-\x88\xbf?\xff=p4\xac\xb0\x05q\xacl\x1c8\x1aV\xd8\x828V6\x0e\x1c\r+lA\x1c+\x1b\x07\x8e\x86\x15\xb6 \x8e\x95\x8d\x03G\xc3\n[\x10\xeb\xca\xbd\xfa\xc4\xd1\xb0\xc2\x16\xc4\xbar\xaf>q4\xac\xb0\x05\xb1\xae|\xde\xafz\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xe3\xbfu\xb8\xadO\x1c\r+lA\xac+\xf7\xea\x13G\xc3\n[\x10\xeb\xca\xbd\xfa\xc4\xd1\xb0\xc2\x16\xc4\xb1\xb2q\xe0hXa\x0b\xe2X\xd98p4\xac\xb0\x05q\xacl\x1c8\x1aV\xd8\x828V6\x0e\x1c\r+lA\x1c+\x1b\x07\x8e\x86\x15\xb6\xb0>\xc4\xd1\xb0\xc2\x16\xd6\x878\x1aV\xd8\x8e\xf0K\xd8\x8e\xf0K\xd8\x8e\xf0K\xd8\x8e\xf0K\xd8\x8e\xf0\xcb/s]\x7f\xf8o$|7\xc4\xdf\xeb\x00\x00\x00\x00IEND\xaeB`\x82'
>>> s[0]
137
>>> s[1]
80
>>> s[2]
78
>>> hex(s[0])
'0x89'
>>> hex(s[1])
'0x50'
>>> hex(s[2])
'0x4e'
>>>
I don't think you'd need the UTF-8 decode step as this is just binary data right?
If you actually want an ASCII representation of the data in hex form to match what you have in the question you could use
>>> ''.join('%02x' % c for c in s)
'89504e470d0a1a0a0000000d4948445200000028000000280802000000039c2f3a000000017352474200aece1ce90000000467414d410000b18f0bfc6105000000097048597300000ec300000ec301c76fa86400000158494441545847e5cdb16d03311404d1eb46adb82bd5e08ae56066197c2c2ea00f4cf4c068082eaf6ff53ec82f613bc22f613bc22f613bc22f613bc22f610beb431c0d2b6c617d88a361852d88bf3fff3d7034acb00571ac6c1c381a56d8823856360e1c0d2b6c411c2b1b078e8615b6208e958d0347c30a5b10ebcabdfac4d1b0c216c4ba72af3e7134acb005b1ae7cdeaf7ab8ad4f1c0d2b6c41ac2be3bf75b8ad4f1c0d2b6c41ac2be3bf75b8ad4f1c0d2b6c41ac2be3bf75b8ad4f1c0d2b6c41ac2be3bf75b8ad4f1c0d2b6c41ac2be3bf75b8ad4f1c0d2b6c41ac2be3bf75b8ad4f1c0d2b6c41ac2be3bf75b8ad4f1c0d2b6c41ac2bf7ea1347c30a5b10ebcabdfac4d1b0c216c4b1b271e06858610be258d9387034acb00571ac6c1c381a56d8823856360e1c0d2b6c411c2b1b078e8615b6b03ec4d1b0c216d687381a56d88ef04bd88ef04bd88ef04bd88ef04bd88ef0cb2f735d7ff86f247c37c4dfeb0000000049454e44ae426082'
You're getting the bytes fine; you just want to print them differently from the default Python method (which uses characters for printable ASCII codes so you can read them more easily). Just iterate over the bytes and format them however you like:
for byte in string:
print(("%02x" % byte).upper(), end="")
If the file isn't too large, you could also do it with one print() call by doing the formatting all at once and printing that:
print("".join(("%02x" % byte).upper() for byte in string))
This will build a string using approximately 6 times the amount of memory as your file before printing it. Use the first method if this could be a problem.
Actually, I just remembered... Python has a module for this!
from binascii import hexlify
print(hexlify(string).upper())
This will actually use even more memory, since it converts the letters in the hex string to uppercase after building it, but if you're OK with lowercase letters in your hex, this is probably the best solution.
BTW, it's advisable not to call what you read from your file string; it's binary data, not text.
I have encounter a case where I need to convert a string of character into a character string in python.
s = "\\x80\\x78\\x07\\x00\\x75\\xb3"
print s #gives: \x80\x78\x07\x00\x75\xb3
what I want is that, given the string s, I can get the real character store in s. which in this case is "\x80, \x78, \x07, \x00, \x75, and \xb3"(something like this)�xu�.
You can use string-escape encoding (Python 2.x):
>>> s = "\\x80\\x78\\x07\\x00\\x75\\xb3"
>>> s.decode('string-escape')
'\x80x\x07\x00u\xb3'
Use unicode-escape encoding (in Python 3.x, need to convert to bytes first):
>>> s.encode().decode('unicode-escape')
'\x80x\x07\x00u³'
you can simply write a function, taking the string and returning the converted form!
something like this:
def str_to_chr(s):
res = ""
s = s.split("\\")[1:] #"\\x33\\x45" -> ["x33","x45"]
for(i in s):
res += chr(int('0'+i, 16)) # converting to decimal then taking the chr
return res
remember to print the return of the function.
to find out what does each line do, run that line, if still have questions comment it... i'll answer
or you can build a string from the byte values, but that might not all be "printable" depending on your encoding, example:
# -*- coding: utf-8 -*-
s = "\\x80\\x78\\x07\\x00\\x75\\xb3"
r = ''
for byte in s.split('\\x'):
if byte: # to get rid of empties
r += chr(int(byte,16)) # convert to int from hex string first
print (r) # given the example, not all bytes are printable char's in utf-8
HTH, Edwin
I'm attempting to write a python implementation of java.util.Properties which has a requirement that unicode characters are written to the output file in the format of \u####
(Documentation is here if you are curious, though it isn't important to the question: http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html)
I basically need something that passes the following test case
def my_encode(s):
# Magic
def my_decode(s):
# Magic
# Easy ones that are solved by .encode/.decode 'unicode_escape'
assert my_decode('\u2603') == u'☃'
assert my_encode(u'☃') == '\\u2603'
# This one also works with .decode('unicode_escape')
assert my_decode('\\u0081') == u'\x81'
# But this one does not quite produce what I want
assert my_encode(u'\u0081') == '\\u0081' # Instead produces '\\x81'
Note that I've tried unicode_escape and it comes close but doesn't quite satisfy what I want
I've noticed that simplejson does this conversion correctly:
>> simplejson.dumps(u'\u0081')
'"\\u0081"'
But I'd rather avoid:
reinventing the wheel
doing some gross substringing of simplejson's output
According to the documentation you linked to:
Characters less than \u0020 and characters greater than \u007E in property keys or values are written as \uxxxx for the appropriate hexadecimal value xxxx.
So, that converts into Python readily as:
def my_encode(s):
return ''.join(
c if 0x20 <= ord(c) <= 0x7E else r'\u%04x' % ord(c)
for c in s
)
For each character in the string, if the code point is between 0x20 and 0x7E, then that character remains unchanged; otherwise, \u followed by the code point encoded as a 4-digit hex number is used. The expression c for c in s is a generator expression, so we convert that back into a string using str.join on the empty string.
For decoding, you can just use the unicode_escape codec as you mentioned:
def my_decode(s):
return s.decode('unicode_escape')