I am in need of a way to get the binary representation of a string in python. e.g.
st = "hello world"
toBinary(st)
Is there a module of some neat way of doing this?
Something like this?
>>> st = "hello world"
>>> ' '.join(format(ord(x), 'b') for x in st)
'1101000 1100101 1101100 1101100 1101111 100000 1110111 1101111 1110010 1101100 1100100'
#using `bytearray`
>>> ' '.join(format(x, 'b') for x in bytearray(st, 'utf-8'))
'1101000 1100101 1101100 1101100 1101111 100000 1110111 1101111 1110010 1101100 1100100'
If by binary you mean bytes type, you can just use encode method of the string object that encodes your string as a bytes object using the passed encoding type. You just need to make sure you pass a proper encoding to encode function.
In [9]: "hello world".encode('ascii')
Out[9]: b'hello world'
In [10]: byte_obj = "hello world".encode('ascii')
In [11]: byte_obj
Out[11]: b'hello world'
In [12]: byte_obj[0]
Out[12]: 104
Otherwise, if you want them in form of zeros and ones --binary representation-- as a more pythonic way you can first convert your string to byte array then use bin function within map :
>>> st = "hello world"
>>> map(bin,bytearray(st))
['0b1101000', '0b1100101', '0b1101100', '0b1101100', '0b1101111', '0b100000', '0b1110111', '0b1101111', '0b1110010', '0b1101100', '0b1100100']
Or you can join it:
>>> ' '.join(map(bin,bytearray(st)))
'0b1101000 0b1100101 0b1101100 0b1101100 0b1101111 0b100000 0b1110111 0b1101111 0b1110010 0b1101100 0b1100100'
Note that in python3 you need to specify an encoding for bytearray function :
>>> ' '.join(map(bin,bytearray(st,'utf8')))
'0b1101000 0b1100101 0b1101100 0b1101100 0b1101111 0b100000 0b1110111 0b1101111 0b1110010 0b1101100 0b1100100'
You can also use binascii module in python 2:
>>> import binascii
>>> bin(int(binascii.hexlify(st),16))
'0b110100001100101011011000110110001101111001000000111011101101111011100100110110001100100'
hexlify return the hexadecimal representation of the binary data then you can convert to int by specifying 16 as its base then convert it to binary with bin.
We just need to encode it.
'string'.encode('ascii')
You can access the code values for the characters in your string using the ord() built-in function. If you then need to format this in binary, the string.format() method will do the job.
a = "test"
print(' '.join(format(ord(x), 'b') for x in a))
(Thanks to Ashwini Chaudhary for posting that code snippet.)
While the above code works in Python 3, this matter gets more complicated if you're assuming any encoding other than UTF-8. In Python 2, strings are byte sequences, and ASCII encoding is assumed by default. In Python 3, strings are assumed to be Unicode, and there's a separate bytes type that acts more like a Python 2 string. If you wish to assume any encoding other than UTF-8, you'll need to specify the encoding.
In Python 3, then, you can do something like this:
a = "test"
a_bytes = bytes(a, "ascii")
print(' '.join(["{0:b}".format(x) for x in a_bytes]))
The differences between UTF-8 and ascii encoding won't be obvious for simple alphanumeric strings, but will become important if you're processing text that includes characters not in the ascii character set.
In Python version 3.6 and above you can use f-string to format result.
str = "hello world"
print(" ".join(f"{ord(i):08b}" for i in str))
01101000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100
The left side of the colon, ord(i), is the actual object whose value
will be formatted and inserted into the output. Using ord() gives you
the base-10 code point for a single str character.
The right hand side of the colon is the format specifier. 08 means
width 8, 0 padded, and the b functions as a sign to output the
resulting number in base 2 (binary).
def method_a(sample_string):
binary = ' '.join(format(ord(x), 'b') for x in sample_string)
def method_b(sample_string):
binary = ' '.join(map(bin,bytearray(sample_string,encoding='utf-8')))
if __name__ == '__main__':
from timeit import timeit
sample_string = 'Convert this ascii strong to binary.'
print(
timeit(f'method_a("{sample_string}")',setup='from __main__ import method_a'),
timeit(f'method_b("{sample_string}")',setup='from __main__ import method_b')
)
# 9.564299999998184 2.943955828988692
method_b is substantially more efficient at converting to a byte array because it makes low level function calls instead of manually transforming every character to an integer, and then converting that integer into its binary value.
This is an update for the existing answers which used bytearray() and can not work that way anymore:
>>> st = "hello world"
>>> map(bin, bytearray(st))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: string argument without an encoding
Because, as explained in the link above, if the source is a string, you must also give the encoding:
>>> map(bin, bytearray(st, encoding='utf-8'))
<map object at 0x7f14dfb1ff28>
''.join(format(i, 'b') for i in bytearray(str, encoding='utf-8'))
This works okay since its easy to now revert back to the string as no
zeros will be added to reach the 8 bits to form a byte hence easy to
revert to string to avoid complexity of removing the zeros added.
a = list(input("Enter a string\t: "))
def fun(a):
c =' '.join(['0'*(8-len(bin(ord(i))[2:]))+(bin(ord(i))[2:]) for i in a])
return c
print(fun(a))
Related
Is there a good way to encode strings to utf-8, but in octal format instead of the default hexadecimal?
For example:
>>> "õ".encode("utf-8")
b'\xc3\xb5'
Here the output is hex, not octal. The output in octal would be: b'\303\265'
Python 3 automatically handles the decoding just fine:
>>> b"\xc3\xb5".decode("utf-8")
'õ'
>>> b'\303\265'.decode("utf-8")
'õ'
Is there a codec or option I'm missing? I'd like to avoid a lot of manual string manipulation.
update: I had misunderstood -- there is no difference between b"\xc3\xb5" and b'\303\265' at all, rather they are just 2 different ways to display the same underlying byte code. In fact:
>>> b"\xc3\xb5" == b'\303\265'
True
Here's a class that overrides the representation of the string it wraps:
>>> class OctUTF8:
... def __init__(self,s):
... self.s = s.encode()
... def __repr__(self):
... return "b'" + ''.join(f'\\{n:03o}' for n in self.s) + "'"
...
>>> s='õ'
>>> OctUTF8(s)
b'\303\265'
This representation can be evaluated as a byte string and decoded back to the original:
>>> eval(repr(OctUTF8(s))).decode()
'õ'
First, you can use ord() to convert a character in a string to it's Unicode form, then, you can use oct():
print(oct(ord("õ")))
Output:
0o365
You can convert each byte in a bytes object to it's octal representation
[oct(b) for b in "õ".encode("utf-8")]
Gives
['0o303', '0o265']
You can manipulate the results to convert it to your desired output
I want to encode string to bytes.
To convert to byes, I used byte.fromhex()
>>> byte.fromhex('7403073845')
b't\x03\x078E'
But it displayed some characters.
How can it be displayed as hex like following?
b't\x03\x078E' => '\x74\x03\x07\x38\x45'
I want to encode string to bytes.
bytes.fromhex() already transforms your hex string into bytes. Don't confuse an object and its text representation -- REPL uses sys.displayhook that uses repr() to display bytes in ascii printable range as the corresponding characters but it doesn't affect the value in any way:
>>> b't' == b'\x74'
True
Print bytes to hex
To convert bytes back into a hex string, you could use bytes.hex method since Python 3.5:
>>> b't\x03\x078E'.hex()
'7403073845'
On older Python version you could use binascii.hexlify():
>>> import binascii
>>> binascii.hexlify(b't\x03\x078E').decode('ascii')
'7403073845'
How can it be displayed as hex like following? b't\x03\x078E' => '\x74\x03\x07\x38\x45'
>>> print(''.join(['\\x%02x' % b for b in b't\x03\x078E']))
\x74\x03\x07\x38\x45
The Python repr can't be changed. If you want to do something like this, you'd need to do it yourself; bytes objects are trying to minimize spew, not format output for you.
If you want to print it like that, you can do:
from itertools import repeat
hexstring = '7403073845'
# Makes the individual \x## strings using iter reuse trick to pair up
# hex characters, and prefixing with \x as it goes
escapecodes = map(''.join, zip(repeat(r'\x'), *[iter(hexstring)]*2))
# Print them all with quotes around them (or omit the quotes, your choice)
print("'", *escapecodes, "'", sep='')
Output is exactly as you requested:
'\x74\x03\x07\x38\x45'
I'm reading a wav audio file in Python using wave module. The readframe() function in this library returns frames as hex string. I want to remove \x of this string, but translate() function doesn't work as I want:
>>> input = wave.open(r"G:\Workspace\wav\1.wav",'r')
>>> input.readframes (1)
'\xff\x1f\x00\xe8'
>>> '\xff\x1f\x00\xe8'.translate(None,'\\x')
'\xff\x1f\x00\xe8'
>>> '\xff\x1f\x00\xe8'.translate(None,'\x')
ValueError: invalid \x escape
>>> '\xff\x1f\x00\xe8'.translate(None,r'\x')
'\xff\x1f\x00\xe8'
>>>
Any way I want divide the result values by 2 and then add \x again and generate a new wav file containing these new values. Does any one have any better idea?
What's wrong?
Indeed, you don't have backslashes in your string. So, that's why you can't remove them.
If you try to play with each hex character from this string (using ord() and len() functions - you'll see their real values. Besides, the length of your string is just 4, not 16.
You can play with several solutions to achieve your result:
'hex' encode:
'\xff\x1f\x00\xe8'.encode('hex')
'ff1f00e8'
Or use repr() function:
repr('\xff\x1f\x00\xe8').translate(None,r'\\x')
One way to do what you want is:
>>> s = '\xff\x1f\x00\xe8'
>>> ''.join('%02x' % ord(c) for c in s)
'ff1f00e8'
The reason why translate is not working is that what you are seeing is not the string itself, but its representation. In other words, \x is not contained in the string:
>>> '\\x' in '\xff\x1f\x00\xe8'
False
\xff, \x1f, \x00 and \xe8 are the hexadecimal representation of for characters (in fact, len(s) == 4, not 24).
Use the encode method:
>>> s = '\xff\x1f\x00\xe8'
>>> print s.encode("hex")
'ff1f00e8'
As this is a hexadecimal representation, encode with hex
>>> '\xff\x1f\x00\xe8'.encode('hex')
'ff1f00e8'
I have this string: Hello, World! and I want to print it using Python as '48:65:6c:6c:6f:2c:20:57:6f:72:6c:64:21'.
hex() works only for integers.
How can it be done?
You can transform your string to an integer generator. Apply hexadecimal formatting for each element and intercalate with a separator:
>>> s = "Hello, World!"
>>> ":".join("{:02x}".format(ord(c)) for c in s)
'48:65:6c:6c:6f:2c:20:57:6f:72:6c:64:21
':'.join(x.encode('hex') for x in 'Hello, World!')
For Python 2.x:
':'.join(x.encode('hex') for x in 'Hello, World!')
The code above will not work with Python 3.x. For 3.x, the code below will work:
':'.join(hex(ord(x))[2:] for x in 'Hello, World!')
Another answer in two lines that some might find easier to read, and helps with debugging line breaks or other odd characters in a string:
For Python 2.7
for character in string:
print character, character.encode('hex')
For Python 3.7 (not tested on all releases of 3)
for character in string:
print(character, character.encode('utf-8').hex())
Some complements to Fedor Gogolev's answer:
First, if the string contains characters whose ASCII code is below 10, they will not be displayed as required. In that case, the correct format should be {:02x}:
>>> s = "Hello Unicode \u0005!!"
>>> ":".join("{0:x}".format(ord(c)) for c in s)
'48:65:6c:6c:6f:20:75:6e:69:63:6f:64:65:20:5:21:21'
^
>>> ":".join("{:02x}".format(ord(c)) for c in s)
'48:65:6c:6c:6f:20:75:6e:69:63:6f:64:65:20:05:21:21'
^^
Second, if your "string" is in reality a "byte string" -- and since the difference matters in Python 3 -- you might prefer the following:
>>> s = b"Hello bytes \x05!!"
>>> ":".join("{:02x}".format(c) for c in s)
'48:65:6c:6c:6f:20:62:79:74:65:73:20:05:21:21'
Please note there is no need for conversion in the above code as a bytes object is defined as "an immutable sequence of integers in the range 0 <= x < 256".
Print a string as hex bytes?
The accepted answer gives:
s = "Hello world !!"
":".join("{:02x}".format(ord(c)) for c in s)
returns:
'48:65:6c:6c:6f:20:77:6f:72:6c:64:20:21:21'
The accepted answer works only so long as you use bytes (mostly ascii characters). But if you use unicode, e.g.:
a_string = u"Привет мир!!" # "Prevyet mir", or "Hello World" in Russian.
You need to convert to bytes somehow.
If your terminal doesn't accept these characters, you can decode from UTF-8 or use the names (so you can paste and run the code along with me):
a_string = (
"\N{CYRILLIC CAPITAL LETTER PE}"
"\N{CYRILLIC SMALL LETTER ER}"
"\N{CYRILLIC SMALL LETTER I}"
"\N{CYRILLIC SMALL LETTER VE}"
"\N{CYRILLIC SMALL LETTER IE}"
"\N{CYRILLIC SMALL LETTER TE}"
"\N{SPACE}"
"\N{CYRILLIC SMALL LETTER EM}"
"\N{CYRILLIC SMALL LETTER I}"
"\N{CYRILLIC SMALL LETTER ER}"
"\N{EXCLAMATION MARK}"
"\N{EXCLAMATION MARK}"
)
So we see that:
":".join("{:02x}".format(ord(c)) for c in a_string)
returns
'41f:440:438:432:435:442:20:43c:438:440:21:21'
a poor/unexpected result - these are the code points that combine to make the graphemes we see in Unicode, from the Unicode Consortium - representing languages all over the world. This is not how we actually store this information so it can be interpreted by other sources, though.
To allow another source to use this data, we would usually need to convert to UTF-8 encoding, for example, to save this string in bytes to disk or to publish to html. So we need that encoding to convert the code points to the code units of UTF-8 - in Python 3, ord is not needed because bytes are iterables of integers:
>>> ":".join("{:02x}".format(c) for c in a_string.encode('utf-8'))
'd0:9f:d1:80:d0:b8:d0:b2:d0:b5:d1:82:20:d0:bc:d0:b8:d1:80:21:21'
Or perhaps more elegantly, using the new f-strings (only available in Python 3):
>>> ":".join(f'{c:02x}' for c in a_string.encode('utf-8'))
'd0:9f:d1:80:d0:b8:d0:b2:d0:b5:d1:82:20:d0:bc:d0:b8:d1:80:21:21'
In Python 2, pass c to ord first, i.e. ord(c) - more examples:
>>> ":".join("{:02x}".format(ord(c)) for c in a_string.encode('utf-8'))
'd0:9f:d1:80:d0:b8:d0:b2:d0:b5:d1:82:20:d0:bc:d0:b8:d1:80:21:21'
>>> ":".join(format(ord(c), '02x') for c in a_string.encode('utf-8'))
'd0:9f:d1:80:d0:b8:d0:b2:d0:b5:d1:82:20:d0:bc:d0:b8:d1:80:21:21'
You can use hexdump's:
import hexdump
hexdump.dump("Hello, World!", sep=":")
(append .lower() if you require lower-case). This works for both Python 2 and 3.
Using map and lambda function can produce a list of hex values, which can be printed (or used for other purposes)
>>> s = 'Hello 1 2 3 \x01\x02\x03 :)'
>>> map(lambda c: hex(ord(c)), s)
['0x48', '0x65', '0x6c', '0x6c', '0x6f', '0x20', '0x31', '0x20', '0x32', '0x20', '0x33', '0x20', '0x1', '0x2', '0x3', '0x20', '0x3a', '0x29']
A bit more general for those who don't care about Python 3 or colons:
from codecs import encode
data = open('/dev/urandom', 'rb').read(20)
print(encode(data, 'hex')) # Data
print(encode(b"hello", 'hex')) # String
This can be done in the following ways:
from __future__ import print_function
str = "Hello, World!"
for char in str:
mm = int(char.encode('hex'), 16)
print(hex(mm), sep=':', end=' ')
The output of this will be in hexadecimal as follows:
0x48 0x65 0x6c 0x6c 0x6f 0x20 0x57 0x6f 0x72 0x6c 0x64 0x21
For something that offers more performance than ''.format(), you can use this:
>>> ':'.join( '%02x'%(v if type(v) is int else ord(v)) for v in 'Hello, World!' )
'48:65:6C:6C:6F:2C:20:57:6F:72:6C:64:21'
>>>
>>> ':'.join( '%02x'%(v if type(v) is int else ord(v)) for v in b'Hello, World!' )
'48:65:6C:6C:6F:2C:20:57:6F:72:6C:64:21'
>>>
I am sorry this couldn't look nicer.
It would be nice if one could simply do '%02x'%v, but that only takes int...
But you'll be stuck with byte-strings b'' without the logic to select ord(v).
With f-string:
"".join(f"{ord(c):x}" for c in "Hello")
Use any delimiter:
>>> "⚡".join(f"{ord(c):x}" for c in "Hello")
'48⚡65⚡6c⚡6c⚡6f'
Just for convenience, very simple.
def hexlify_byteString(byteString, delim="%"):
''' Very simple way to hexlify a byte string using delimiters '''
retval = ""
for intval in byteString:
retval += ('0123456789ABCDEF'[int(intval / 16)])
retval += ('0123456789ABCDEF'[int(intval % 16)])
retval += delim
return(retval[:-1])
hexlify_byteString(b'Hello, World!', ":")
# Out[439]: '48:65:6C:6C:6F:2C:20:57:6F:72:6C:64:21'
Python 2.x has chr(), which converts a number in the range 0-255 to a byte string with one character with that numeric value, and unichr(), which converts a number in the range 0-0x10FFFF to a Unicode string with one character with that Unicode codepoint. Python 3.x replaces unichr() with chr(), in keeping with its "Unicode strings are default" policy, but I can't find anything that does exactly what the old chr() did. The 2to3 utility (from 2.6) leaves chr calls alone, which is not right in general :(
(This is for parsing and serializing a file format which is explicitly defined in terms of 8-bit bytes.)
Try the following:
b = bytes([x])
For example:
>>> bytes([255])
b'\xff'
Consider using bytearray((255,)) which works the same in Python2 and Python3. In both Python generations the resulting bytearray-object can be converted to a bytes(obj) which is an alias for a str() in Python2 and real bytes() in Python3.
# Python2
>>> x = bytearray((32,33))
>>> x
bytearray(b' !')
>>> bytes(x)
' !'
# Python3
>>> x = bytearray((32,33))
>>> x
bytearray(b' !')
>>> bytes(x)
b' !'
In case you want to write Python 2/3 compatible code, use six.int2byte
Yet another alternative (Python 3.5+):
>>> b'%c' % 65
b'A'
>>> import struct
>>> struct.pack('B', 10)
b'\n'
>>> import functools
>>> bchr = functools.partial(struct.pack, 'B')
>>> bchr(10)
b'\n'
simple replacement based on small range memoization (should work on 2 and 3), good performance on CPython and pypy
binchr = tuple([bytes(bytearray((b,))) for b in range(256)]).__getitem__
binchr(1) -> b'\x01'