Byte representation of unicode string

Byte representation of unicode string - python

This is python3 code:
>>> bytes(json.dumps({'Ä':0}), "utf-8")
b'{"\\u00c4": 0}'
json.dumps() returns unicode string and bytes() returns its' bytes representation - string encoded into utf-8.
How do I achieve the same result in Lua? I need a bytes representation of a json object which contains non-ascii chars.

You have to do it manually.
local function utf8_to_unicode(utf8str, pos)
local code, size = utf8str:byte(pos), 1
if code >= 0xC0 and code < 0xFE then
local mask = 64
code = code - 128
repeat
local next_byte = utf8str:byte(pos + size) or 0
if next_byte >= 0x80 and next_byte < 0xC0 then
code, size = (code - mask - 2) * 64 + next_byte, size + 1
else
code, size = utf8str:byte(pos), 1
end
mask = mask * 32
until code < mask
end
-- returns code, number of bytes in this utf8 char
return code, size
end
function utf8_to_python(utf8str)
local pos = 1
local z = ''
while pos <= #utf8str do
local unicode, size = utf8_to_unicode(utf8str, pos)
pos = pos + size
if unicode < 0x80 then
z = z..string.char(unicode)
elseif unicode < 0x10000 then
z = z..string.format('\\\\u%04x', unicode)
else
z = z..string.format('\\\\U%08x', unicode)
end
end
return z
end
Usage:
local json = require('json')
local x = {['Ä'] = 0}
local y = json.encode(x)
print(y) --> {"Ä":0}
local z = utf8_to_python(y)
print(z) --> {"\\u00c4":0}

A simpler version using string.gsub:
local function python_escape(str)
return (string.gsub(
str,
-- leading byte followed by one or more continuation bytes;
-- decimal version for Lua 5.1: "[\194-\244][\128-\191]+",
"[\xC2-\xF4][\x80-\xBF]+"
function (non_ASCII)
local codepoint = utf8.codepoint(non_ASCII)
if codepoint <= 0xFFFF then
return ("\\u%04x"):format(codepoint)
else
return ("\\U%08x"):format(codepoint)
end
end))
end
I put parentheses around the return value (string.gsub(--[[...]])) to strip away the second return value of string.gsub (the number of replacements).

Related

Converting an integer to a string of escaped hex (NOT bytes)

I am trying to come up with a way to convert an integer to a string of escaped hex, as if it was coded that way... for example:
x = '\xab\xcd\xff\xff'
I have tried the following function, but it appears that chr() only does this for characters with the integer value 160 and below:
def int2chr(val, size):
out = ''
for _ in range(size):
tmp = val & 0xff
val >>= 8
out = f'{chr(tmp)}{out}'
return out`
From this, I get:
>>> int2chr(0xabcdffff, 4)
'����'
>>>
I would like it to return '\xab\xcd\xff\xff' instead...
This doesn't work any better:
def int2chr(val, size):
out = ''
for _ in range(size):
tmp = val & 0xff
val >>= 8
out = f'\\x{hex(tmp)[2:]}{out}'
return out
>>> int2chr(0xabcdffff,4)
'\\xab\\xcd\\xff\\xff'
>>>
The closest I've been able to come gives me bytes, not a string:
b'\xab\xcd\xff\xff'
I am very surprised that there isn't an easy way to do this. Can anyone shed some light?

Python code for hashing byte values with md5

Please accept this question from an inexperienced (and very enthusiastic) programmer who tries to learn:
I need to calculate the md5 hashes of every byte combination from 0x00 to 0xff. I tried to do this with Python, but I'm not sure how Python interprets my input. As said, I need to hash the byte values, not the character 'fa' or '00', but the values themselves.
Here is an example of one code that I have tested. The problems are that the output from bytes.fromhex show some of the hex numbers represented as ascii. I suppose then that the ascii-representation is hashed, not the byte-value. The second problem is that I'm unsure of how to use hashlib correctly so that the byte value is hashed.
import hashlib
# Global variables
HEX_VALUES = {0:"0",1:"1",2:"2",3:"3",4:"4",5:"5",6:"6",7:"7",8:"8",9:"9",10:"a",11:"b",12:"c",13:"d",14:"e",15:"f"}
# Helper function for converting decimal number to another base.
def dec_to_base(num,base):
exp = 0
list1 = []
while (num // base ** exp) > 0:
num2 = (num // base ** exp) % base
list1.insert(0,num2)
exp += 1
return list1
# Function for converting decimal to hex numbers.
def dec_to_hex(num):
ret_val = []
for x in dec_to_base(num,16):
x = HEX_VALUES[x]
ret_val.append(x)
ret_val_str = ''.join(ret_val)
ret_val_str_pad = ret_val_str.zfill(4)
# Returns the hex number as a string with four zero-padding.
return ret_val_str_pad
for i in range(1,65536):
hex_number = bytes.fromhex(dec_to_hex(i))
print(hex_number)
h = hashlib.md5(hex_number)
md5_hash = h.hexdigest()
print(md5_hash)
# Checks after TARGET STRING

md5 accepts bytes but not string.
md5(b'00').hexdigest() # 'b4b147bc522828731f1a016bfa72c073'
md5('00').hexdigest() # TypeError: Unicode-objects must be encoded before hashing
String must be encoded into bytes before passing to md5.
md5('00'.encode()).hexdigest() # 'b4b147bc522828731f1a016bfa72c073'
'00'.encode() == b'00' # True
b'\x30\x30' == b'00' # True
In above case, in C terms, you're passing byte array {0x30, 0x30} as argument. In your code, hex_number = bytes.fromhex(dec_to_hex(i)) returns size 2 bytes. Depending on your goal, you might not get what you want.
hex_number = bytes.fromhex(dec_to_hex(1)) # b'\x00\x01'
md5(b'\x00\x01').hexdigest() # '441077cc9e57554dd476bdfb8b8b8102'
md5(b'\x01').hexdigest() # '55a54008ad1ba589aa210d2629c1df41'

How to keep leading zeros in binary integer (python)?

I need to calculate a checksum for a hex serial word string using XOR. To my (limited) knowledge this has to be performed using the bitwise operator ^. Also, the data has to be converted to binary integer form. Below is my rudimentary code - but the checksum it calculates is 1000831. It should be 01001110 or 47hex. I think the error may be due to missing the leading zeros. All the formatting I've tried to add the leading zeros turns the binary integers back into strings. I appreciate any suggestions.
word = ('010900004f')
#divide word into 5 separate bytes
wd1 = word[0:2]
wd2 = word[2:4]
wd3 = word[4:6]
wd4 = word[6:8]
wd5 = word[8:10]
#this converts a hex string to a binary string
wd1bs = bin(int(wd1, 16))[2:]
wd2bs = bin(int(wd2, 16))[2:]
wd3bs = bin(int(wd3, 16))[2:]
wd4bs = bin(int(wd4, 16))[2:]
#this converts binary string to binary integer
wd1i = int(wd1bs)
wd2i = int(wd2bs)
wd3i = int(wd3bs)
wd4i = int(wd4bs)
wd5i = int(wd5bs)
#now that I have binary integers, I can use the XOR bitwise operator to cal cksum
checksum = (wd1i ^ wd2i ^ wd3i ^ wd4i ^ wd5i)
#I should get 47 hex as the checksum
print (checksum, type(checksum))

Why use all this conversions and the costly string functions?
(I will answer the X part of your XY-Problem, not the Y part.)
def checksum (s):
v = int (s, 16)
checksum = 0
while v:
checksum ^= v & 0xff
v >>= 8
return checksum
cs = checksum ('010900004f')
print (cs, bin (cs), hex (cs) )
Result is 0x47 as expected. Btw 0x47 is 0b1000111 and not as stated 0b1001110.

s = '010900004f'
b = int(s, 16)
print reduce(lambda x, y: x ^ y, ((b>> 8*i)&0xff for i in range(0, len(s)/2)), 0)

Just modify like this.
before:
wd1i = int(wd1bs)
wd2i = int(wd2bs)
wd3i = int(wd3bs)
wd4i = int(wd4bs)
wd5i = int(wd5bs)
after:
wd1i = int(wd1bs, 2)
wd2i = int(wd2bs, 2)
wd3i = int(wd3bs, 2)
wd4i = int(wd4bs, 2)
wd5i = int(wd5bs, 2)
Why your code doesn't work?
Because you are misunderstanding int(wd1bs) behavior.
See doc here. So Python int function expect wd1bs is 10 base by default.
But you expect int function to treat its argument as 2 base.
So you need to write as int(wd1bs, 2)
Or you can also rewrite your entire code like this. So you don't need to use bin function in this case. And this code is basically same as #Hyperboreus answer. :)
w = int('010900004f', 16)
w1 = (0xff00000000 & w) >> 4*8
w2 = (0x00ff000000 & w) >> 3*8
w3 = (0x0000ff0000 & w) >> 2*8
w4 = (0x000000ff00 & w) >> 1*8
w5 = (0x00000000ff & w)
checksum = w1 ^ w2 ^ w3 ^ w4 ^ w5
print hex(checksum)
#'0x47'
And this is more shorter one.
import binascii
word = '010900004f'
print hex(reduce(lambda a, b: a ^ b, (ord(i) for i in binascii.unhexlify(word))))
#0x47

Write boolean string to binary file?

I have a string of booleans and I want to create a binary file using these booleans as bits. This is what I am doing:
# first append the string with 0s to make its length a multiple of 8
while len(boolString) % 8 != 0:
boolString += '0'
# write the string to the file byte by byte
i = 0
while i < len(boolString) / 8:
byte = int(boolString[i*8 : (i+1)*8], 2)
outputFile.write('%c' % byte)
i += 1
But this generates the output 1 byte at a time and is slow. What would be a more efficient way to do it?

It should be quicker if you calculate all your bytes first and then write them all together. For example
b = bytearray([int(boolString[x:x+8], 2) for x in range(0, len(boolString), 8)])
outputFile.write(b)
I'm also using a bytearray which is a natural container to use, and can also be written directly to your file.
You can of course use libraries if that's appropriate such as bitarray and bitstring. Using the latter you could just say
bitstring.Bits(bin=boolString).tofile(outputFile)

Here's another answer, this time using an industrial-strength utility function from the PyCrypto - The Python Cryptography Toolkit where, in version 2.6 (the current latest stable release), it's defined inpycrypto-2.6/lib/Crypto/Util/number.py.
The comments preceeding it say:
Improved conversion functions contributed by Barry Warsaw, after careful benchmarking
import struct
def long_to_bytes(n, blocksize=0):
"""long_to_bytes(n:long, blocksize:int) : string
Convert a long integer to a byte string.
If optional blocksize is given and greater than zero, pad the front of the
byte string with binary zeros so that the length is a multiple of
blocksize.
"""
# after much testing, this algorithm was deemed to be the fastest
s = b('')
n = long(n)
pack = struct.pack
while n > 0:
s = pack('>I', n & 0xffffffffL) + s
n = n >> 32
# strip off leading zeros
for i in range(len(s)):
if s[i] != b('\000')[0]:
break
else:
# only happens when n == 0
s = b('\000')
i = 0
s = s[i:]
# add back some pad bytes. this could be done more efficiently w.r.t. the
# de-padding being done above, but sigh...
if blocksize > 0 and len(s) % blocksize:
s = (blocksize - len(s) % blocksize) * b('\000') + s
return s

You can convert a boolean string to a long using data = long(boolString,2). Then to write this long to disk you can use:
while data > 0:
data, byte = divmod(data, 0xff)
file.write('%c' % byte)
However, there is no need to make a boolean string. It is much easier to use a long. The long type can contain an infinite number of bits. Using bit manipulation you can set or clear the bits as needed. You can then write the long to disk as a whole in a single write operation.

You can try this code using the array class:
import array
buffer = array.array('B')
i = 0
while i < len(boolString) / 8:
byte = int(boolString[i*8 : (i+1)*8], 2)
buffer.append(byte)
i += 1
f = file(filename, 'wb')
buffer.tofile(f)
f.close()

A helper class (shown below) makes it easy:
class BitWriter:
def __init__(self, f):
self.acc = 0
self.bcount = 0
self.out = f
def __del__(self):
self.flush()
def writebit(self, bit):
if self.bcount == 8 :
self.flush()
if bit > 0:
self.acc |= (1 << (7-self.bcount))
self.bcount += 1
def writebits(self, bits, n):
while n > 0:
self.writebit( bits & (1 << (n-1)) )
n -= 1
def flush(self):
self.out.write(chr(self.acc))
self.acc = 0
self.bcount = 0
with open('outputFile', 'wb') as f:
bw = BitWriter(f)
bw.writebits(int(boolString,2), len(boolString))
bw.flush()

Use the struct package.
This can be used in handling binary data stored in files or from network connections, among other sources.
Edit:
An example using ? as the format character for a bool.
import struct
p = struct.pack('????', True, False, True, False)
assert p == '\x01\x00\x01\x00'
with open("out", "wb") as o:
o.write(p)
Let's take a look at the file:
$ ls -l out
-rw-r--r-- 1 lutz lutz 4 Okt 1 13:26 out
$ od out
0000000 000001 000001
000000
Read it in again:
with open("out", "rb") as i:
q = struct.unpack('????', i.read())
assert q == (True, False, True, False)

Using Python How can I read the bits in a byte?

I have a file where the first byte contains encoded information. In Matlab I can read the byte bit by bit with var = fread(file, 8, 'ubit1'), and then retrieve each bit by var(1), var(2), etc.
Is there any equivalent bit reader in python?

Read the bits from a file, low bits first.
def bits(f):
bytes = (ord(b) for b in f.read())
for b in bytes:
for i in xrange(8):
yield (b >> i) & 1
for b in bits(open('binary-file.bin', 'r')):
print b

The smallest unit you'll be able to work with is a byte. To work at the bit level you need to use bitwise operators.
x = 3
#Check if the 1st bit is set:
x&1 != 0
#Returns True
#Check if the 2nd bit is set:
x&2 != 0
#Returns True
#Check if the 3rd bit is set:
x&4 != 0
#Returns False

With numpy it is easy like this:
Bytes = numpy.fromfile(filename, dtype = "uint8")
Bits = numpy.unpackbits(Bytes)
More info here:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfile.html

You won't be able to read each bit one by one - you have to read it byte by byte. You can easily extract the bits out, though:
f = open("myfile", 'rb')
# read one byte
byte = f.read(1)
# convert the byte to an integer representation
byte = ord(byte)
# now convert to string of 1s and 0s
byte = bin(byte)[2:].rjust(8, '0')
# now byte contains a string with 0s and 1s
for bit in byte:
print bit

Joining some of the previous answers I would use:
[int(i) for i in "{0:08b}".format(byte)]
For each byte read from the file. The results for an 0x88 byte example is:
>>> [int(i) for i in "{0:08b}".format(0x88)]
[1, 0, 0, 0, 1, 0, 0, 0]
You can assign it to a variable and work as per your initial request.
The "{0.08}" is to guarantee the full byte length

To read a byte from a file: bytestring = open(filename, 'rb').read(1). Note: the file is opened in the binary mode.
To get bits, convert the bytestring into an integer: byte = bytestring[0] (Python 3) or byte = ord(bytestring[0]) (Python 2) and extract the desired bit: (byte >> i) & 1:
>>> for i in range(8): (b'a'[0] >> i) & 1
...
1
0
0
0
0
1
1
0
>>> bin(b'a'[0])
'0b1100001'

There are two possible ways to return the i-th bit of a byte. The "first bit" could refer to the high-order bit or it could refer to the lower order bit.
Here is a function that takes a string and index as parameters and returns the value of the bit at that location. As written, it treats the low-order bit as the first bit. If you want the high order bit first, just uncomment the indicated line.
def bit_from_string(string, index):
i, j = divmod(index, 8)
# Uncomment this if you want the high-order bit first
# j = 8 - j
if ord(string[i]) & (1 << j):
return 1
else:
return 0
The indexing starts at 0. If you want the indexing to start at 1, you can adjust index in the function before calling divmod.
Example usage:
>>> for i in range(8):
>>> print i, bit_from_string('\x04', i)
0 0
1 0
2 1
3 0
4 0
5 0
6 0
7 0
Now, for how it works:
A string is composed of 8-bit bytes, so first we use divmod() to break the index into to parts:
i: the index of the correct byte within the string
j: the index of the correct bit within that byte
We use the ord() function to convert the character at string[i] into an integer type. Then, (1 << j) computes the value of the j-th bit by left-shifting 1 by j. Finally, we use bitwise-and to test if that bit is set. If so return 1, otherwise return 0.

Supposing you have a file called bloom_filter.bin which contains an array of bits and you want to read the entire file and use those bits in an array.
First create the array where the bits will be stored after reading,
from bitarray import bitarray
a=bitarray(size) #same as the number of bits in the file
Open the file,
using open or with, anything is fine...I am sticking with open here,
f=open('bloom_filter.bin','rb')
Now load all the bits into the array 'a' at one shot using,
f.readinto(a)
'a' is now a bitarray containing all the bits

This is pretty fast I would think:
import itertools
data = range(10)
format = "{:0>8b}".format
newdata = (False if n == '0' else True for n in itertools.chain.from_iterable(map(format, data)))
print(newdata) # prints tons of True and False

I think this is a more pythonic way:
a = 140
binary = format(a, 'b')
The result of this block is:
'10001100'
I was to get bit planes of the image and this function helped me to write this block:
def img2bitmap(img: np.ndarray) -> list:
if img.dtype != np.uint8 or img.ndim > 2:
raise ValueError("Image is not uint8 or gray")
bit_mat = [np.zeros(img.shape, dtype=np.uint8) for _ in range(8)]
for row_number in range(img.shape[0]):
for column_number in range(img.shape[1]):
binary = format(img[row_number][column_number], 'b')
for idx, bit in enumerate("".join(reversed(binary))[:]):
bit_mat[idx][row_number, column_number] = 2 ** idx if int(bit) == 1 else 0
return bit_mat
Also by this block, I was able to make primitives image from extracted bit planes
img = cv2.imread('test.jpg', cv2.IMREAD_GRAYSCALE)
out = img2bitmap(img)
original_image = np.zeros(img.shape, dtype=np.uint8)
for i in range(original_image.shape[0]):
for j in range(original_image.shape[1]):
for data in range(8):
x = np.array([original_image[i, j]], dtype=np.uint8)
data = np.array([data], dtype=np.uint8)
flag = np.array([0 if out[data[0]][i, j] == 0 else 1], dtype=np.uint8)
mask = flag << data[0]
x[0] = (x[0] & ~mask) | ((flag[0] << data[0]) & mask)
original_image[i, j] = x[0]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Byte representation of unicode string - python

Related

Converting an integer to a string of escaped hex (NOT bytes)

Python code for hashing byte values with md5

How to keep leading zeros in binary integer (python)?

Write boolean string to binary file?

Using Python How can I read the bits in a byte?

Categories

Resources