Assuming I have some ASCII characters in a string, let's say s = ABC, how can I retrieve the binary representation as a string?
In this case,
A = '01000001'
B = '01000010'
C = '01000011'
so I want something like make_binary('ABC') to return '010000010100001001000011'
I know I can get the hex values for a string. I know I can get the binary representation of an integer. I don't know if there's any way to tie all these pieces together.
Use the ord() funcction to get the integer encoding of each character.
def make_binary(s):
return "".join([format(ord(c), '08b') for c in s])
print(make_binary("ABC"))
08b formatting returns the number formatted as 8 bits with leading zeroes.
I think the other answer is wrong. Maybe I interpret wrongly the question.
In any case, I think you are asking for the 'bit' representation. Binary often is used for bytes representation (the .bin files, etc.)
The byte representation is given by an encoding, so you should encode the string, and you will get a byte array. This is your binary (as byte) representation.
But it seems you are asking 'bit-representation'. That is different (and the other answer, IMHO is wrong). You may convert the byte array into bit representation, like on the other answer. Note: you are converting bytes. The other answer will fails on any characters above 127, by showing you only the binary representation of one byte.
So:
def make_binary(s):
return "".join(format(c, '08b') for c in s.encode('utf-8'))
and the test (which file on #Barmar answer).
>>> print(make_binary("ABC"))
010000010100001001000011
>>> print(make_binary("Á"))
1100001110000001
Related
I need to do some work with bytes in Python and I've come across a byte string I don't really understand:
b"H\x00\x84\xffQ\x00\xa6\xff+\x00\x96\xff\xc2\xffI\xff\xa5\xff'\xff\x8a\xff\x19\xff\x19\xff\xf6\xfe\xb0\xfe\xc7\xfeJ\xfel\xfe\xf8\xfd+\xfe\xef\xfd:\xfe\xc3\xfd*\xfe_\xfd\xdf\xfd\n\xfd\xa3\xfd\xc6\xfcq\xfd\xbd\xfc?\xfd"
So according to what I know, bytes should be represented as \xhh, where hh are hexadecimal values (from 0 to f). However, in the third segment, there is \xffQ, and farther on there are other characters which shouldn't appear: I, ', *, :, ? etc.
I've used hex() method to see what would be the outcome, and I got this:
480084ff5100a6ff2b0096ffc2ff49ffa5ff27ff8aff19ff19fff6feb0fec7fe4afe6cfef8fd2bfeeffd3afec3fd2afe5ffddffd0afda3fdc6fc71fdbdfc3ffd
As you can see, some parts of the hex are the same, but e.g. \xffQ was changed into ff51. I need to append some data to this byte string, so I'd like to know what's going on there (or how to get the same result).
Both repr and str when processing a bytes object will print ASCII characters where possible. Otherwise their hexadecimal values will be shown in the form \xNN.
It might help you to visualise the content if you print it as all hexadecimal as follows:
b = b"H\x00\x84\xffQ\x00\xa6\xff+\x00\x96\xff\xc2\xffI\xff\xa5\xff'\xff\x8a\xff\x19\xff\x19\xff\xf6\xfe\xb0\xfe\xc7\xfeJ\xfel\xfe\xf8\xfd+\xfe\xef\xfd:\xfe\xc3\xfd*\xfe_\xfd\xdf\xfd\n\xfd\xa3\xfd\xc6\xfcq\xfd\xbd\xfc?\xfd"
print(''.join(hex(b_) for b_ in b))
Output:
0x480x00x840xff0x510x00xa60xff0x2b0x00x960xff0xc20xff0x490xff0xa50xff0x270xff0x8a0xff0x190xff0x190xff0xf60xfe0xb00xfe0xc70xfe0x4a0xfe0x6c0xfe0xf80xfd0x2b0xfe0xef0xfd0x3a0xfe0xc30xfd0x2a0xfe0x5f0xfd0xdf0xfd0xa0xfd0xa30xfd0xc60xfc0x710xfd0xbd0xfc0x3f0xfd
Or you can use binascii module if you want to visualize the content:
import binascii
print(binascii.hexlify(b"hello world"))
Output:
68656c6c6f20776f726c64
For example given an arbitrary string. Could be chars or just random bytes:
string = '\xf0\x9f\xa4\xb1'
I want to output:
b'\xf0\x9f\xa4\xb1'
This seems so simple, but I could not find an answer anywhere. Of course just typing the b followed by the string will do. But I want to do this runtime, or from a variable containing the strings of byte.
if the given string was AAAA or some known characters I can simply do string.encode('utf-8'), but I am expecting the string of bytes to just be random. Doing that to '\xf0\x9f\xa4\xb1' ( random bytes ) produces unexpected result b'\xc3\xb0\xc2\x9f\xc2\xa4\xc2\xb1'.
There must be a simpler way to do this?
Edit:
I want to convert the string to bytes without using an encoding
The Latin-1 character encoding trivially (and unlike every other encoding supported by Python) encodes every code point in the range 0x00-0xff to a byte with the same value.
byteobj = '\xf0\x9f\xa4\xb1'.encode('latin-1')
You say you don't want to use an encoding, but the alternatives which avoid it seem far inferior.
The UTF-8 encoding is unsuitable because, as you already discovered, code points above 0x7f map to a sequence of multiple bytes (up to four bytes) none of which are exactly the input code point as a byte value.
Omitting the argument to .encode() (as in a now-deleted answer) forces Python to guess an encoding, which produces system-dependent behavior (probably picks UTF-8 on most systems except Windows, where it will typically instead choose something much more unpredictable, as well as usually much more sinister and horrible).
I found a working solution
import struct
def convert_string_to_bytes(string):
bytes = b''
for i in string:
bytes += struct.pack("B", ord(i))
return bytes
string = '\xf0\x9f\xa4\xb1'
print (convert_string_to_bytes(string)))
output:
b'\xf0\x9f\xa4\xb1'
Mongo Change Stream is retuned in a binary format
To be able to script a mongo change stream I want to encode the byte array into a format that would be command line parameter safe.
pprint.pprint(change['_id']['_data'])
(b'\x82[8\x92G\x00\x00\x00\x01Fd_id\x00d[8\x91\xf2.\xc2\xd4\x00\x0b\xabO\x98'
b'\x00Z\x10\x04\x16,\x92\xf8\xbf\x92G\x87\x8d1\xff(\x1a\x1b{\xc8\x04')
What would be a safe format to convert the binary array text that would be accepted as a parameter?
Example for conversion from binary to given format, and from given format input str() back into binary would be helpful.
Attempt 1
base64.b85encode(change['_id']['_data']).decode('ascii')
'f?GI}M*si-0Y+qBX=DIoTR4&OF2d9R3#(6<09p_P7A%tZzmi9XjWPcy8XJ4a1O'
Going from binary to base85 works, but I can't seem to figure the way back.
EDIT: Reopening Rational
I think this question should not be marked as duplicate as this question targets conversion of random byte arrays which do not represent a human readable character / encoding. As follows the previous question focuses on converting a string into binary array and back, which is a special case of binary to string representation while my use case calls for a generic solution.
Oh cool, I think I've figured it out
base64.b85decode will take string as well as binary as input.
Example:
b = b'\x82[8\x929\x00\x00\x00\x04Fd'
b == base64.b85decode(base64.b85encode(b).decode('ascii'))
True
I'm trying to convert a binary I have in python (a gzipped protocol buffer object) to an hexadecimal string in a string escape fashion (eg. \xFA\x1C ..).
I have tried both
repr(<mygzipfileobj>.getvalue())
as well as
<mygzipfileobj>.getvalue().encode('string-escape')
In both cases I end up with a string which is not made of HEX chars only.
\x86\xe3$T]\x0fPE\x1c\xaa\x1c8d\xb7\x9e\x127\xcd\x1a.\x88v ...
How can I achieve a consistent hexadecimal conversion where every single byte is actually translated to a \xHH format ? (where H represents a valid hex char 0-9A-F)
The \xhh format you often see is a debugging aid, the output of the repr() applied to a string with non-ASCII codepoints. Any ASCII codepoints are left a in-place to leave what readable information is there.
If you must have a string with all characters replaced by \xhh escapes, you need to do so manually:
''.join(r'\x{0:02x}'.format(ord(c)) for c in value)
If you need quotes around that, you'd need to add those manually too:
"'{0}'".format(''.join(r'\x{:02x}'.format(ord(c)) for c in value))
Could you explain in detail what the difference is between byte string and Unicode string in Python. I have read this:
Byte code is simply the converted source code into arrays of bytes
Does it mean that Python has its own coding/encoding format? Or does it use the operation system settings?
I don't understand. Could you please explain?
Thank you!
No, Python does not use its own encoding - it will use any encoding that it has access to and that you specify.
A character in a str represents one Unicode character. However, to represent more than 256 characters, individual Unicode encodings use more than one byte per character to represent many characters.
bytes objects give you access to the underlying bytes. str objects have the encode method that takes a string representing an encoding and returns the bytes object that represents the string in that encoding. bytes objects have the decode method that takes a string representing an encoding and returns the str that results from interpreting the byte as a string encoded in the the given encoding.
For example:
>>> a = "αά".encode('utf-8')
>>> a
b'\xce\xb1\xce\xac'
>>> a.decode('utf-8')
'αά'
We can see that UTF-8 is using four bytes, \xce, \xb1, \xce, and \xac, to represent two characters.
Related reading:
Python Unicode Howto (from the official documentation)
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
Here's an attempt at a simple explanation that only applies to Python 3. I hope that coming from a lay person, it would help to clear some confusion for the completely uninitiated. If there are any technical inaccuracies, please forgive me and feel free to point it out.
Suppose you create a string using Python 3 in the usual way:
stringobject = 'ant'
stringobject would be a unicode string.
A unicode string is made up of unicode characters. In stringobject above, the unicode characters are the individual letters, e.g. a, n, t
Each unicode character is assigned a code point, which can be expressed as a sequence of hex digits (a hex digit can take on 16 values, ranging from 0-9 and A-F). For instance, the letter 'a' is equivalent to '\u0061', and 'ant' is equivalent to '\u0061\u006E\u0074'.
So you will find that if you type in,
stringobject = '\u0061\u006E\u0074'
stringobject
You will also get the output 'ant'.
Now, unicode is converted to bytes, in a process known as encoding. The reverse process of converting bytes to unicode is known as decoding.
How is this done? Since each hex digit can take on 16 different values, it can be reflected in a 4-bit binary sequence (e.g. the hex digit 0 can be expressed in binary as 0000, the hex digit 1 can be expressed as 0001 and so forth). If a unicode character has a code point consisting of four hex digits, it would need a 16-bit binary sequence to encode it.
Different encoding systems specify different rules for converting unicode to bits. Most importantly, encodings differ in the number of bits they use to express each unicode character.
For instance, the ASCII encoding system uses only 8 bits (1 byte) per character. Thus it can only encode unicode characters with code points up to two hex digits long (i.e. 256 different unicode characters). The UTF-8 encoding system uses 8 to 32 bits (1 to 4 bytes) per character, so it can encode unicode characters with code points up to 8 hex digits long, i.e. everything.
Running the following code:
byteobject = stringobject.encode('utf-8')
byteobject, type(byteobject)
converts a unicode string into a byte string using the utf-8 encoding system, and returns b'ant', bytes'.
Note that if you used 'ASCII' as the encoding system, you wouldn't run into any problems since all code points in 'ant' can be expressed with 1 byte. But if you had a unicode string containing characters with code points longer than two hex digits, you would get a UnicodeEncodeError.
Similarly,
stringobject = byteobject.decode('utf-8')
stringobject, type(stringobject)
gives you 'ant', str.